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The  first  edition  of  this  book  brought  together,  for  the  first  time,  a 
comprehensive  collection  of  information  required  by  the  person  developing  software 
for  the  Advanced  Micro  Devices  29K  family  of  RISC  microprocessors  and 
microcontrollers.  This  second  edition  contains  all  the  material  from  the  first.  In 
addition  it  adds  many  new  topics  such  as  performance  evaluation  and  on-chip  cache 
operation.  Topics  such  as  interrupt  processing  and  software  debugging  are  extended 
with  the  addition  of  new  techniques.  The  book  is  useful  to  the  computer  professional 
and  student  interested  in  the  29K  family  RISC  implementation.  It  does  not  assume 
that  the  reader  is  familiar  with  RISC  techniques. 

Although  certain  members  of  the  29K  family  are  equally  suited  to  the 
construction  of  a  workstation  or  an  embedded  application,  the  material  is  mainly 
applicable  for  embedded  application  development.  This  slant  shall  be  appreciated  by 
most  readers;  since  early  in  the  29K’s  introduction  AMD  has  promoted  the  family  as 
a  collection  of  processors  spanning  a  wide  range  of  embedded  performance. 
Additionally,  in  recent  years,  AMD  started  a  range  of  microcontrollers,  initially  with 
the  Am29200.  The  inclusion  of  onchip  peripherals  in  the  microcontroller 
implementations  resulted  in  this  particular  extension  to  the  family  being  well 
received  by  the  embedded  processor  community. 

The  success  of  the  29K  family,  and  of  RISC  technology  in  general,  has  created 
considerable  interest  within  the  microprocessor  industry.  A  growing  number  of 
engineers  are  evaluating  RISC,  and  an  increasing  number  are  selecting  RISC  rather 
than  CISC  designs  for  new  products.  Higher  processor  performance  is  the  main 
reason  cited  for  adopting  new  RISC  designs.  This  book  describes  the  methods  used 
by  the  29K  family  —  many  of  which  are  characteristic  of  the  RISC-approach  —  to 
obtain  a  performance  gain  vis-a-vis  CISC  processors.  Many  of  the  processor  and 
software  features  described  will  be  compared  with  an  equivalent  CISC  method;  this 
shall  assist  the  engineer  making  the  CISC  to  RISC  transition. 


Because  the  29K  family  architecture  reveals  the  processor’s  internal  pipeline 
operation  much  more  than  a  CISC  architecture,  a  better  understanding  of  how  the 
software  can  control  the  hardware  and  avoid  resource  conflicts  is  required  to  obtain 
the  best  performance.  Up  to  this  point,  software  engineers  have  had  to  glean 
information  about  programming  the  29K  family  from  scattered  application  notes, 
conference  proceedings  and  other  publications.  In  addition  much  of  the  necessary 
information  has  never  been  documented.  This  has  lead  to  a  number  of  difficulties, 
particularly  where  the  most  efficient  use  of  the  RISC  design  features  is  sought. 

The  material  presented  is  practical  rather  than  theoretical.  Each  chapter  is  in  a 
somewhat  standalone  form,  reducing  the  need  to  read  earlier  chapters  before  later 
chapters  are  studied.  Many  of  the  code  examples  are  directly  usable  in  real  embedded 
systems  rather  than  as  student  exercises.  Engineers  planning  on  using  the  29K 
family  will  be  able  to  extract  useful  code  sequences  from  the  book  for  integration  into 
their  own  designs.  Much  of  the  material  presented  has  been  used  by  AMD,  and  other 
independent  companies,  in  building  training  classes  for  computer  professionals 
wishing  to  quickly  gain  an  understanding  of  the  29K  family. 

This  book  is  organized  as  follows: 

Chapter  1  describes  the  architectural  characteristics  of  the  29K  RISC 
microprocessor  and  microcontroller  family.  The  original  family  member,  the 
Am29000  processor,  is  described  first.  Then  the  family  tree  evolution  is  dealt  with  in 
terms  of  each  member’s  particular  features.  Although  all  29K  processors  are 
application  code  compatible  they  are  not  all  pin  compatible.  The  ability  of  the  29K 
family  to  be  flexible  in  its  memory  requirements  is  presented.  In  addition,  the  chapter 
shows  the  importance  of  keeping  the  RISC  pipeline  busy  if  high  performance  is  to  be 
achieved. 

Chapter  2  deals  with  application  programming.  It  covers  the  main  topics 
required  by  a  software  developer  to  produce  code  for  execution  on  a  29K. 
Application  coding  is  done  in  a  high  level  language  and  the  chapter  assumes  the  C 
language  is  most  widely  used.  The  dual  register  and  memory  stack  technique  used  by 
the  29K  procedure  calling-convention  is  described  in  detail,  along  with  the  process 
of  maintaining  the  processor’s  local  register  file  as  a  cache  for  the  top  of  the  register 
stack.  Application  programs  require  runtime  support.  The  library  services  typically 
used  by  developers  make  demands  upon  such  operating  system  services.  The  Host 
Interface  (HIF)  specifies  a  set  operating  system  services.  The  HIF  services  are 
described  and  their  relevance  put  in  context. 

Chapter  3  explains  how  to  program  a  29K  at  assembly  level.  Methods  of 
pardoning  and  accessing  a  processor’s  register  space  are  described.  This  includes  the 
special  register  space  which  can  only  be  reached  by  assembly  level  instructions.  The 
reader  is  shown  how  to  deal  with  such  topics  as  branch  delay  slots  and  memory  access 
latency.  It  is  not  expected  that  application  programs  will  be  developed  in  assembly 
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language,  rather,  that  assembly  language  coding  skills  are  required  by  the  operating 
system  developer.  Some  developers  may  only  be  required  to  utilize  assembly  coding 
to  implement,  say,  a  small  interrupt  handler  routine. 

Chapter  4  deals  with  the  complex  subject  of  29K  interrupts.  Because  29K 
processors  make  no  use  of  microcode,  the  range  of  interrupt  handler  options  is 
extended  over  the  typical  CISC  type  processor.  Techniques  new  to  the  reader  familiar 
with  CISC,  such  as  lightweight  interrupts  and  interrupt  context  caching,  are 
presented.  Most  application  developers  are  moving  toward  writing  interrupt 
handlers  in  a  high  level  language,  such  as  C.  This  chapter  describes  the  process  of 
preparing  the  29K  to  handle  a  C  level  signal  handler  after  taking  an  interrupt  or  trap. 

Chapter  5  deals  with  operating  system  issues.  It  describes,  in  detail,  the  process 
of  performing  an  application  task  context  switch.  This  is  one  of  the  major  services 
performed  by  an  operating  system.  A  detailed  knowledge  of  the  utilized 
procedural-linkage  mechanism  and  29K  architectural  features  is  required  to 
implement  a  high  performance  context  switch.  Also  dealt  with  are  issues  concerning 
the  operation  and  maintenance  of  on-chip  instruction  and  data  memory  cache. 

Chapter  6  describes  the  Translation  Look-Aside  Buffer  (TLB)  which  is 
incorporated  into  many  of  the  29K  family  members.  Its  use  as  a  basic  building  block 
for  a  Memory  Management  Unit  (MMU)  is  described.  This  chapter  also 
demonstrates  the  use  of  the  TLB  to  implement  a  software-controlled  cache  which 
improves  overall  system  performance. 

Chapter  7  explains  the  operation  of  popular  software  debugging  tools  such  as 
MiniMON29K  and  GDB.  The  process  of  building  a  debug  environment  for  an 
embedded  application  is  described.  Also  dealt  with  is  the  Universal  Debug  Interface 
(UDI)  which  is  used  to  connect  the  user-interface  process  with  the  process 
controlling  the  target  hardware.  The  use  of  UDI  introduces  new  freedom  in  tool 
choice  to  the  embedded  product  developer. 

Chapter  8  helps  with  the  sometimes  difficult  task  of  processor  selection. 
Performance  benchmarks  are  presented  for  all  the  current  29K  family  members.  The 
effect  on-chip  cache  and  memory  system  performance  have  on  system  performance 
is  quantified.  Systems  are  considered  in  terms  of  their  performance  and  software 
programming  requirements. 

Although  I  am  the  sole  author  of  this  book,  I  would  like  to  thank  my  colleagues 
at  Advanced  Micro  Devices  for  their  help  with  reviewing  early  manuscripts.  I  am 
also  grateful  for  their  thoughtful  suggestions,  many  of  which  were  offered  during  the 
porting  of  4.3bsd  UNIX  to  the  Am29000  processor.  I  would  also  like  to  thank  Grant 
Maxwell  for  his  helpful  comments  and  in  particular  his  review  of  chapters  1 , 5  and  8 . 
Bob  Brians  also  extensively  reviewed  the  first  edition  and  suggested  a  number  of 
improvements;  he  also  made  many  helpful  comments  when  he  reviewed  the 
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manuscript  for  this  second  edition.  Mike  Johnson  and  Steve  Guccione  reviewed  the 
section  introducing  superscalar  processors.  Chip  Freitag  reviewed  chapter  8  and 
helped  me  improve  its  quality.  Discussions  with  Leo  Lozano  helped  resolve  many  of 
the  issues  concerning  cache  operation  dealt  with  in  chapter  5.  Thanks  also  to 
Embedded  Systems  Programming  for  allowing  the  use  of  material  describing  the 
GDB  debugger  which  first  appeared  in  their  volume  5  number  12  issue.  Embedded 
System  Engineering  is  also  thanked  for  allowing  the  reuse  of  material  describing  the 
Am29040  processor  and  Architectural  Simulator.  Finally,  I  would  like  to  thank  the 
Product  Marketing  Department  of  AMD’s  Embedded  Processor  Division,  for  their 
encouragement  to  complete  this  second  edition. 
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Chapter  1 


Architectural  Overview 


This  Chapter  deals  with  a  number  of  topics  relevant  to  the  selection  of  a  29K™ 
family  member.  General  RISC  architecture  characteristics  are  discussed  before  each 
family  member  is  described  in  more  detail.  A  RISC  microprocessor  can  achieve  high 
performance  only  if  its  pipeline  is  kept  effectively  busy  —  this  is  explained.  Finally, 
the  architectural  simulator  is  described;  it  is  an  important  tool  in  evaluating  a  proces¬ 
sors  performance. 

The  instruction  set  of  the  29K  family  was  designed  to  closely  match  the  internal 
representation  of  operations  generated  by  optimizing  compilers.  Instruction  execu¬ 
tion  times  are  not  burdened  by  redundant  instruction  formats  and  options.  CISC  mi¬ 
croprocessors  trap  computational  sequences  in  microcode.  Microcode  is  a  set  of  se¬ 
quences  of  internal  processor  operations  combined  to  perform  a  machine  instruction. 
A  CISC  microprocessor  contains  an  on-chip  microprogram  memory  to  hold  the  mi¬ 
crocode  required  to  support  the  complex  instructions.  It  is  difficult  for  a  compiler  to 
select  CISC  instruction  sequences  which  result  in  the  microcode  being  efficiently 
applied  to  the  overall  computational  task.  The  myopic  microcode  results  in  processor 
operational  overhead.  The  compiler  for  a  CISC  can  not  remove  the  overhead,  it  can 
only  reduce  it  by  making  the  best  selection  from  the  array  of  instruction  options  and 
formats  —  such  as  addressing  modes.  The  compiler  for  a  29K  RISC  can  exploit  lean 
instructions  whose  operation  is  free  of  microcode  and  always  visible  to  the  compiler 
code-generator. 

Each  29K  processor  has  a  4-stage  RISC  pipeline:  consisting  of  first,  a  fetch 
stage,  followed  by  decode,  execute  and  write-back  stages.  Instructions,  with  few  ex¬ 
ceptions,  execute  in  a  single-cycle.  Although  instructions  are  streamlined,  they  still 
support  operations  on  two  source  operands,  placing  the  result  in  a  third  operand.  Reg¬ 
isters  are  used  to  supply  operands  for  most  instructions,  and  the  processor  contains  a 
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large  number  of  registers  to  reduce  the  need  to  fetch  data  from  off-chip  memory. 
When  external  memory  is  accessed  it  is  via  explicit  load  and  store  operations,  and 
never  via  extended  instruction  addressing  modes.  The  large  number  of  registers, 
within  the  processor’s  register  file,  act  effectively  as  a  cache  for  program  data.  How¬ 
ever,  the  implementation  of  a  multiport  register  file  is  superior  to  a  conventional  data 
cache  as  it  enables  simultaneous  access  to  multiple  operands. 

Parameter  passing  between  procedure  calls  is  supported  by  dynamically  sized 
register  windows.  Each  procedure’s  register  window  is  allocated  from  a  stack  of  128 
32-bit  registers.  This  results  in  a  very  efficient  procedure  call  mechanism,  and  is  re¬ 
sponsible  for  considerable  operational  benefits  compared  to  the  typical  CISC  meth¬ 
od  of  pushing  and  popping  procedure  parameters  from  a  memory  stack. 

Processors  in  the  29K  family  also  make  use  of  other  techniques  usually 
associated  with  RISC,  such  as  delayed  branching,  to  keep  the  instruction  hungry 
RISC  fed  and  prevent  pipeline  stalling. 

The  freedom  from  microcode  not  only  benefits  the  effectiveness  of  the  instruc¬ 
tion  processing  stream,  but  also  benefits  the  interrupt  and  trap  mechanism  required  to 
support  such  events  as  external  hardware  interrupts.  The  preparations  performed  by 
29K  hardware  for  interrupt  processing  are  very  brief,  and  this  lightweight  approach 
enables  programmers  to  define  their  own  interrupt  architecture;  enabling  optimiza¬ 
tions  to  be  selected  which  are  best  for,  say,  interrupt  through  put,  or  short  latency  in 
commencing  handler  processing. 

The  29K  family  includes  3-bus  Harvard  memory  architecture  processors, 
2-bus  processors  which  have  simplified  and  flexible  memory  system  interfaces,  and 
microcontrollers  with  considerable  on-chip  system  support.  The  range  is  extensive, 
yet  User  mode  instruction  compatibility  is  achieved  across  the  entire  family  [AMD 
1993a].  Within  each  family-grouping,  there  is  also  pin  compatibility.  The  family 
supports  the  construction  of  a  scalable  product  range  with  regard  to  performance  and 
system  cost.  For  example,  all  of  the  performance  of  the  top-end  processor  configura¬ 
tions  may  not  be  required,  or  be  appropriate,  in  a  product  today  but  it  may  be  neces¬ 
sary  in  the  future.  Because  of  the  range  and  scalability  of  the  family,  making  a  com¬ 
mitment  to  29K  processor  technology  is  an  investment  supported  by  the  ability  to 
scale-down  or  scale-up  a  design  in  the  future.  Much  of  the  family’s  advantages  are 
attained  by  the  flexibility  in  memory  architecture  choice.  This  is  significant  because 
of  the  important  impact  a  memory  system  can  have  on  performance,  overall  cost,  and 
design  and  test  time  [Olson  1988]  [Olson  1989]. 

The  microcontroller  family  members  contain  all  the  necessary  RAM  and  ROM 
interface  glue-logic  on-chip,  permitting  memory  devices  to  be  directly  connected  to 
the  processor.  Given  that  memory  systems  need  only  be  8-bit  or  16-bit  wide,  the 
introduction  of  these  devices  should  hasten  the  selection  of  embedded  RISC  in  future 
product  designs.  The  use  of  RISC  need  not  be  considered  an  expensive  option  in 
terms  of  system  cost  or  hardware  and  software  design  times.  Selecting  RISC  is  not 
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only  the  correct  decision  for  expensive  workstation  designs,  but  increasingly  for  a 
wide  range  of  performance  and  price  sensitive  embedded  products. 


1.1  A  RISC  DEFINITION 

The  process  of  dealing  with  an  instruction  can  be  broken  down  into  stages  (see 
Figure  1-1).  An  instruction  must  then  flow  through  the  pipeline  of  stages  before  its 
processing  is  complete.  Independent  hardware  is  used  at  each  pipeline  stage.  In¬ 
formation  is  passed  to  subsequent  pipeline  stages  at  the  completion  of  each  processor 
cycle.  At  any  instant,  the  pipeline  stages  are  processing  several  instructions  which  are 
each  at  a  different  stage  of  completion.  Pipelining  increases  the  utilization  of  the  pro¬ 
cessor  hardware,  and  effectively  reduces  the  number  of  processor  cycles  required  to 
process  an  instruction. 


Instruction  #1 
Instruction  #2 
Instruction  #3 


fetch 

decode 

execute 

write-back 

fetch 

decode 

execute 

write-back 

fetch 

decode 

execute 

t+l 

' _ J  ' 

1-cycle 

Figure  1-1.  RISC  Pipeline 

With  a  4-stage  pipeline  an  instruction  takes  four  cycles  to  complete,  assuming 
the  pipeline  stages  are  clocked  at  each  processor  cycle.  However,  the  processor  is 
able  to  start  a  new  instruction  at  each  new  processor  cycle,  and  the  average  proces¬ 
sing  time  for  an  instruction  is  reduced  to  1-cycle.  Instructions  which  execute  in 
1-cycle  have  only  1-cycle  latency  as  their  results  are  available  to  the  next  instruc¬ 
tion. 

The  4-stage  pipeline  of  the  29K  processor  family  supports  a  simplified  execute 
stage.  This  is  made  possible  by  simplifying  instruction  formats,  limiting  instruction 
complexity  and  operating  on  data  help  in  registers.  The  simplified  execute  stage 
means  that  only  a  single  processor  cycle  is  required  to  complete  execute-stage  pro¬ 
cessing  and  the  cycle  time  is  also  minimized. 

CISC  processors  support  a  complex  execution-stage  which  require  several  pro¬ 
cessor  cycles  to  complete.  When  an  instruction  is  ready  for  execution  it  is  broken 
down  into  a  sequence  of  microinstructions  (see  Figure  1-2).  These  simplified 
instructions  are  supplied  by  the  on-chip  microprogram  memory.  Each  microinstruc¬ 
tion  must  be  decoded  and  executed  separately  before  the  instruction  execution-stage 
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is  complete.  Depending  on  the  amount  of  microcode  needed  to  implement  a  CISC 
instruction,  the  number  of  cycles  required  to  complete  instruction  processing  varies 
from  instruction  to  instruction. 


microcode  program 


Instruction  #1 


Instruction  #2 


◄ - ► 


fetch 

dec 

dec 

dec 

dec 

exe 

exe 

exe 

exe 

fetch 

dec 

dec 

dec 

dec 

exe 

exe 

exe 

exe 

|  |  t  |t+l|t+2  |t+3 1  |  |  |  |  |  |  |  |  | 


I 


1 -cycle 

Figure  1-2.  CISC  Pipeline 

Because  the  hardware  used  by  the  execute-stage  of  a  CISC  processor  is  utilized 
for  a  number  of  processor  cycles,  the  other  stages  of  the  pipeline  have  available  addi¬ 
tional  cycles  for  their  own  operation.  For  example,  if  an  execute-stage  requires  four 
processors  cycles,  the  overlapping  fetch-stage  of  the  next  instruction  has  four  cycles 
to  complete.  If  the  fetch-stage  takes  four  or  less  cycles,  then  no  stalling  of  the  pipe¬ 
line  due  to  execute-stage  starvation  shall  occur.  Starvation  or  pipeline  stalling  occurs 
when  a  previous  stage  has  not  completed  its  processing  and  can  not  pass  its  results  to 
the  input  of  the  next  pipeline  stage. 

During  the  evolution  of  microprocessors,  earlier  designs  operated  with  slower 
memories  than  are  available  today.  Both  processor  and  memory  speeds  have  seen 
great  improvements  in  recent  years.  However,  the  low  cost  of  high  performance 
memory  devices  now  readily  available  has  shifted  microprocessor  design.  When 
memory  was  slow  it  made  sense  overlapping  multicycle  instruction  fetch  stages  with 
multicycle  execute  stages.  Once  an  instruction  had  been  fetched  it  was  worthwhile 
getting  as  much  execute-value  as  possible  since  the  cost  of  fetching  the  instruction 
was  high.  This  approach  drove  processor  development  and  led  to  the  name  Complex 
Instruction  Set  Computer. 

Faster  memory  means  that  instruction  processing  times  are  no  longer  fetch- 
stage  dominated.  With  a  reduction  in  the  number  of  cycles  required  by  the  fetch- 
stage,  the  execute-stage  becomes  the  dominant  factor  in  determining  processor  per¬ 
formance.  Consequently  attention  turned  to  the  effectiveness  of  the  microcode  se¬ 
quences  used  to  perform  CISC  instruction  execution.  Careful  analysis  of  CISC 
instruction  usage  revealed  that  the  simpler  instructions  were  much  more  frequently 
used  than  the  complex  ones  which  required  long  microcode  sequences.  The  conclu- 
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sion  drawn  was  that  microcode  rarely  provides  the  exact  sequence  of  operations  re¬ 
quired  to  support  a  high  level  language  instruction. 

The  variable  instruction  execution  times  of  CISC  instructions  results  in  com¬ 
plex  pipeline  management.  It  is  also  more  difficult  for  a  compiler  to  work  out  the 
execution  times  for  different  combinations  of  CISC  instructions.  For  that  matter  it  is 
harder  for  the  assembly  level  programmer  to  estimate  the  execution  times  of,  say,  an 
interrupt  handler  code  sequence  compared  to  the  equivalent  RISC  code  sequence. 
More  importantly,  streamlining  pipeline  operations  enables  reduced  processor  cycle 
times  and  greater  control  by  a  compiler  of  the  processor’s  operation.  Given  that  the 
execute-stage  dominates  performance,  the  RISC  approach  is  to  fetch  more  instruc¬ 
tions  which  can  be  simply  executed.  Although  a  RISC  program  may  contain  20% 
more  instructions  than  a  program  for  a  CISC,  the  total  number  of  cycles  required  to 
perform  a  task  is  reduced. 

A  number  of  processor  characteristics  have  been  proposed  in  the  press  as  indica¬ 
tive  of  RISC  or  CISC.  Many  of  these  proposals  are  made  by  marketing  departments 
which  wish  to  control  markets  by  using  RISC  and  CISC  labels  as  marketing  rather 
than  engineering  expressions.  I  consider  a  processor  to  be  RISC  if  it  is  microcode  free 
and  has  a  simple  instruction  execute-stage  which  can  complete  in  a  single  cycle. 

1.2  FAMILY  MEMBER  FEATURES 

Although  this  book  is  about  Programming  the  29K  RISC  Family,  the  following 
sections  are  not  restricted  to  only  describing  features  which  can  be  utilized  by  soft¬ 
ware.  They  also  briefly  describe  key  hardware  features  which  affect  a  processor’s 
performance  and  hence  its  selection. 

All  members  of  the  family  have  User  mode  binary  code  compatibility.  This 
greatly  simplifies  the  task  of  porting  application  code  from  one  processor  to  another. 
Some  system-mode  code  may  need  to  be  changed  due  to  differences  in  such  things  as 
field  assignments  of  registers  in  special  register  space. 

Given  the  variation  between  family  members  such  as  the  3-bus  Am29050  float¬ 
ing-point  processor  and  the  Am29205  microcontroller,  it  is  remarkable  that  there  is 
so  much  software  compatibility.  The  number  of  family  members  is  expected  to  con¬ 
tinue  to  grow;  but  already  there  is  a  wide  selection  enabling  systems  of  ranging  per¬ 
formance  and  cost  to  be  constructed  (see  Figure  1-3).  If  AMD  continues  to  grow  the 
family  at  “both  ends  of  the  performance  spectrum”,  we  might  expect  to  see  new  mi¬ 
crocontroller  family  members  as  well  as  superscalar  microprocessors  [Johnson 
1991].  AMD  has  stated  that  future  microprocessors  will  be  pin  compatible  with  the 
current  2-bus  family  members. 

I  think  one  of  the  key  features  of  29K  family  members  is  their  ability  to  operate 
with  varying  memory  system  configurations.  It  is  possible  to  build  very  high  perfor¬ 
mance  Harvard  type  architectures,  or  low  cost  —  high  access  latency  —  DRAM 
based  systems.  Two  types  of  instruction  memory  caching  are  supported.  Branch  Tar- 
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Figure  1-3.  Processor  Price-Performance  Summary 


get  Cache  (BTC)  memory  is  used  in  3-bus  family  members  to  hide  memory  access 
latencies.  The  2-bus  family  members  make  use  of  more  conventional  bandwidth  im¬ 
proving  instruction  cache  memory. 

The  higher  performance  2-bus  processors  and  microcontrollers  have  on-chip 
data  cache.  When  cache  hit  ratios  are  high,  processing  speeds  can  be  decoupled  from 
memory  system  speeds;  especially  when  the  processor  is  clocked  at  a  higher  speed 
than  the  off-chip  memory  system. 

A  second  key  feature  of  processors  in  the  29K  family  is  that  the  programmer 
must  supply  the  interrupt  handler  save  and  restore  mechanism.  Typically  a  CISC  type 
processor  will  save  the  processor  context,  when  an  exception  occurs,  in  accordance 
with  the  on-chip  microcode.  The  29K  family  is  free  of  microcode,  making  the  user 
free  to  tailor  the  interrupt  and  exception  processing  mechanism  to  suit  the  system. 
This  often  leads  to  new  and  more  efficient  interrupt  handling  techniques.  The  fast  in¬ 
terrupt  response  time,  and  large  interrupt  handling  capacity  made  possible  by  the 
flexible  architecture,  has  been  sited  as  one  of  the  key  reasons  for  selecting  a  29K  pro¬ 
cessor  design. 

All  members  of  the  29K  family  make  some  use  of  burst-mode  memory  inter¬ 
faces.  Burst-mode  memory  accesses  provide  a  simplified  transfer  mechanism  for 
high  bandwidth  memory  systems.  Burst-mode  addressing  only  applies  to  consecu¬ 
tive  access  sequences,  it  is  used  for  all  instruction  fetches  and  for  load-multiple  and 
store-multiple  data  accesses. 
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The  3-bus  microprocessors  are  dependent  on  burst-mode  addressing  to  free-up 
the  address  bus  after  a  new  instruction  fetch  sequence  has  been  established.  The 
memory  system  is  required  to  supply  instructions  at  sequential  addresses  without  the 
processor  supplying  any  further  address  information;  at  least  until  a  jump  or  call  type 
instruction  is  executed.  This  makes  the  address  bus  free  for  use  in  data  memory  ac¬ 
cess. 

The  non  3-bus  processors  can  not  simultaneously  support  instruction  fetching 
and  data  access  from  external  memory.  Consequently  the  address  bus  continually 
supplies  address  information  for  the  instruction  or  data  access  currently  being  sup¬ 
ported  by  the  external  memory.  However,  burst-mode  access  signals  are  still  sup¬ 
plied  by  the  processor.  Indicating  that  the  processor  will  require  another  access  at  the 
next  sequential  address,  after  the  current  access  is  complete,  is  an  aid  in  achieving 
maximum  memory  access  bandwidth.  There  are  also  a  number  of  memory  devices 
available  which  are  internally  organized  to  give  highest  performance  when  accessed 
in  burst-mode. 


1 .3  THE  Am29000  3-BUS  MICROPROCESSOR 

The  Am29000™  processor  is  pin  compatible  with  other  3-bus  members  of  the 
family  (see  Table  1-1)  [AMD  1989]  [Johnson  1987].  It  was  the  first  member  of  the 
family,  introduced  in  1987.  It  is  the  core  processor  for  many  later  designs,  such  as  the 
current  2-bus  processor  product  line.  Much  of  this  book  describes  the  operation  of 
the  Am29000  processor  as  the  framework  for  understanding  the  rest  of  the  family. 

The  processor  can  be  connected  to  separate  Instruction  and  data  memory  sys¬ 
tems,  thus  exploiting  the  Harvard  architectural  advantages  (See  Figure  1-4).  Alter¬ 
natively,  a  simplified  2-bus  system  can  be  constructed  by  connecting  the  data  and 
address  busses  together;  this  enables  a  single  memory  system  to  be  constructed. 
When  the  full  potential  of  the  3-bus  architecture  is  utilized,  it  is  usually  necessary  to 
include  in  the  memory  system  a  bridge  to  enable  instruction  memory  to  be  accessed. 
The  processor  does  not  support  any  on-chip  means  to  transfer  information  on  the 
instruction  bus  to  the  data  bus. 

The  load  and  store  instructions,  used  for  all  external  memory  access,  have  an 
option  field  (OPT2-0)  which  is  presented  to  device  pins  during  the  data  transfer  op¬ 
eration.  Option  field  value  OPT=4  is  defined  to  indicate  the  bridge  should  permit 
ROM  space  to  be  read  as  if  it  were  data.  Instructions  can  be  located  in  two  separate 
spaces:  Instruction  space  and  ROM  space.  Often  these  spaces  become  the  same,  as 
the  IREQT  pin  (instruction  request  type)  is  not  decoded  so  as  to  enable  distinction 
between  the  two  spaces.  When  ROM  and  Instruction  spaces  are  not  common,  a  range 
of  data  memory  space  can  be  set  aside  for  accessing  Instruction  space  via  the  bridge. 
It  is  best  to  avoid  overlapping  external  address  spaces  if  high  level  code  is  to  access 
any  memory  located  in  the  overlapping  regions  (see  section  1.10.4). 
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Table  1-1.  Pin  Compatible  3-bus  29K  Family  Processors 


Processor 

Am29000 

Am29050 

Am29005 

Instruction  Cache 

BTC 

32x4  words 

BTC  64x4  or 
128x2  words 

No 

I-Cache  Associativity 

2  Way 

2  Way 

N/A 

Date  Cache 

- 

- 

- 

D-Cache  Associativity 

- 

- 

- 

On-Chip  Floating-Point 

No 

Yes 

No 

On-Chip  MMU 

Yes 

Yes 

No 

Integer  Multiply  in  h/w 

No 

Yes 

No 

Programmable  Bus  Sizing 

No 

No 

No 

On-Chip  Interrupt 

Yes 

Yes 

Yes 

Controller  Inputs 

6 

6 

6 

Scalable  Bus  Clocking 

No 

No 

No 

Burst-mode  Addressing 

Yes,  up  to  IK  bytes 

Yes,  up  to  IK  bytes 

Yes,  up  to  IK  bytes 

Freeze  Mode  Processing 

Yes 

Yes 

Yes 

Delayed  Branching 

Yes 

Yes 

Yes 

On-Chip  Timer 

Yes 

Yes 

Yes 

On-Chip  Memory  Controler 

No 

No 

No 

DMA  Channels 

- 

- 

- 

Byte  Endian 

Big/Little 

Big/Little 

Big/Little 

JTAG  Debugging 

No 

No 

No 

Clock  Speeds  (MHz) 

16,20,25,33 

20,25,33,40 

16 
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Figure  1  -4.  Am29000  Processor  3-bus  Flarvard  Memory  System 
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All  processors  in  the  29K  family  support  byte  and  half-word  size  read  and  write 
access  to  data  memory.  The  original  Am29000  (pre  rev-D,  1990)  only  supported 
word  sized  data  access.  This  resulted  in  read-modify-write  cycles  to  modify  sub¬ 
word  sized  objects.  The  processor  supports  insert-  and  extract-byte  and  half-word 
instructions  to  assist  with  sub-word  operations.  These  instructions  are  little  used 
today. 

The  processor  has  a  Branch  Target  Cache  (BTC)  memory  which  is  used  to  sup¬ 
ply  the  first  four  instructions  of  previously  taken  branches.  Successful  branches  are 
20%  of  a  typical  instruction  mix.  Using  burst-mode  and  interleaf  techniques, 
memory  systems  can  sustain  the  high  bandwidths  required  to  keep  the  instruction 
hungry  RISC  fed.  However,  when  a  branch  occurs,  memory  systems  can  present  con¬ 
siderable  latency  before  supplying  the  first  instruction  of  the  branch  target.  For  ex¬ 
ample,  consider  an  instruction  memory  system  which  has  a  3-cycle  first  access  laten¬ 
cy  but  can  sustain  1-cycle  access  in  burst-mode.  Typically  every  5th  instruction  is  a 
branch  and  for  the  example  the  branch  instruction  would  take  effectively  5-cycles  to 
complete  its  execution  (the  pipeline  would  be  stalled  for  4-cycles  (see  section  1.13)). 
If  all  other  instructions  were  executed  in  a  single-cycle  the  average  cycle  time  per 
instruction  would  be  1.8  (i.e.  9/5);  not  the  desired  sustained  single-cycle  operation. 
The  BTC  can  hide  all  3-cycles  of  memory  access  latency,  and  enable  the  branch 
instruction  to  execute  in  a  single-cycle. 

The  programmer  has  little  control  over  BTC  operation;  it  is  maintained  internal¬ 
ly  by  processor  hardware.  There  are  32  cache  entries  (known  as  cache  blocks)  of  four 
instructions  each.  They  are  configured  in  a  2-way  set  associative  arrangement.  En¬ 
tries  are  tagged  to  distinguish  between  accesses  made  in  User  mode  and  Supervisor 
mode;  they  are  also  tagged  to  differentiate  between  virtual  addresses  and  physical 
addresses.  Because  the  address  in  the  program  counter  is  presented  to  the  BTC  at  the 
same  time  it  is  presented  to  the  MMU,  the  BTC  does  not  operate  with  physical  ad¬ 
dresses.  Entries  are  not  tagged  with  per-process  identifiers;  consequently  the  BTC 
can  not  distinguish  between  identical  virtual  addresses  belonging  to  different  pro¬ 
cesses  operating  with  virtual  addressing.  Systems  which  operate  with  multiple  tasks 
using  virtual  addressing  must  invalidate  the  cache  when  a  user-task  context  switch 
occurs.  Using  the  IRETINV  (interrupt  return  and  invalidate)  instruction  is  one  con¬ 
venient  way  of  doing  this. 

The  BTC  is  able  to  hold  the  instructions  of  frequently  taken  trap  handler  rou¬ 
tines,  but  there  is  no  means  to  lock  code  sequences  into  the  cache.  Entries  are  replaced 
in  the  cache  on  a  random  basis,  the  most  recently  occurring  branches  replacing  the 
current  entries  when  necessary. 

The  3-bus  members  of  the  the  29K  family  can  operate  the  shared  address  bus  in 
a  pipeline  mode.  If  a  memory  system  is  able  to  latch  an  address  before  an  instruction 
or  data  transfer  is  complete,  the  address  bus  can  be  freed  to  start  a  subsequent  access. 
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Allowing  two  accesses  to  be  in  progress  simultaneously  can  be  effectively  used  by 
the  separate  instruction  and  data  memory  systems  of  a  Harvard  architecture. 

1.3.1  The  Am29005 

The  Am29005™  is  pin  compatible  with  other  3-bus  members  of  the  family  (see 
Table  1-1).  It  is  an  inexpensive  version  of  the  Am29000  processor.  The  Translation 
Look-Aside  Buffer  (TLB)  and  the  Branch  Target  Cache  (BTC)  have  been  omitted.  It 
is  available  at  a  lower  clock  speed,  and  only  in  the  less  expensive  plastic  packaging.  It 
is  a  good  choice  for  systems  which  are  price  sensitive  and  do  not  require  Memory 
Management  Unit  support  or  the  performance  advantages  of  the  BTC.  An  Am29005 
design  can  always  be  easily  upgraded  with  an  Am29000  replacement  later.  In  fact  the 
superior  debugging  environment  offered  by  the  Am29000  or  the  Am29050™  may 
make  the  use  of  one  of  these  processor  a  good  choice  during  software  debugging.  The 
faster  processor  can  always  be  replaced  by  an  Am29005  when  production  com¬ 
mences. 

1 .4  THE  Am29050  3-BUS  FLOATING-POINT  MICROPROCESSOR 

The  Am29050  processor  is  pin  compatible  with  other  3-bus  members  of  the 
family  (see  Table  1-1)  [AMD  1991a].  Many  of  the  features  of  the  Am29050  were  al¬ 
ready  described  in  the  section  describing  its  closely  related  relative,  the  Am29000. 
The  Am29050  processor  offers  a  number  of  additional  performance  and  system  sup¬ 
port  features  when  compared  with  the  Am29000.  The  most  notable  is  the  direct 
execution  of  double-precision  (64-bit)  and  single-precision  (32-bit)  floating-point 
arithmetic  on-chip.  The  Am29000  has  to  rely  on  software  emulation  or  the 
Am29027™  floating-point  coprocessor  to  perform  floating-point  operations.  The 
introduction  of  the  Am29050  eliminated  the  need  to  design  the  Am29027  coproces¬ 
sor  into  floating-point  intensive  systems. 

The  processor  contains  a  Branch  Target  Cache  (BTC)  memory  system  like  the 
Am29000;  but  this  time  it  is  twice  as  big,  with  32  entries  in  each  of  the  two  sets  rather 
than  the  Am29000’s  16  entries  per  set.  BTC  entries  are  not  restricted  to  four  instruc¬ 
tions  per  entry;  there  is  an  option  (bit  CO  in  the  CFG  register)  to  arrange  the  BTC  as 
64  entries  per  set,  with  each  entry  containing  two  instructions  rather  than  four.  The 
smaller  entry  size  is  more  useful  with  lower  latency  memory  systems.  For  example,  if 
a  memory  system  has  a  2-cycle  first-access  start-up  latency  it  is  more  efficient  to 
have  a  larger  number  of  2-instruction  entries.  After  all,  for  this  example  system,  the 
third  and  fourth  instructions  in  a  four  per  entry  arrangement  could  just  as  efficiently 
be  fetched  from  the  external  memory. 

The  Am29050  also  incorporates  an  Instruction  Forwarding  path  which  addi¬ 
tionally  helps  to  reduce  the  effects  of  instruction  memory  access  latency.  When  a  new 
instruction  fetch  sequence  commences,  and  the  target  of  the  sequence  is  not  found  in 
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the  BTC,  an  external  memory  access  is  performed  to  start  filling  the  Instruction  Pre¬ 
fetch  Buffer  (IPB).  With  the  Am29000  processor  the  fetch  stage  of  the  processor 
pipeline  is  fed  from  the  IPB,  but  the  Am29050  can  by-pass  the  fetch  stage  and  feed 
the  first  instruction  directly  into  the  decode  pipeline  stage  using  the  instruction  for¬ 
warding  technique.  By-passing  also  enables  up  to  four  cycles  of  external  memory 
latency  to  be  hidden  when  a  BTC  hit  occurs  (see  section  1.10). 

The  Am29050  incorporates  a  Translation  Look-Aside  Buffer  (TLB)  for 
Memory  Management  Unit  support,  just  like  the  Am29000  processor.  However  it 
also  has  two  region  mapping  registers.  These  permit  large  areas  of  memory  to  be 
mapped  without  using  up  the  smaller  TLB  entries.  They  are  very  useful  for  mapping 
large  data  memory  regions,  and  their  use  reduces  the  TLB  software  management 
overhead. 

The  processor  can  also  speed  up  data  memory  accesses  by  making  the  access 
address  available  a  cycle  earlier  than  the  Am29000.  The  method  is  used  to  reduce 
memory  load  operations  which  have  a  greater  influence  on  pipeline  stalling  than 
store  operations.  Normally  the  address  of  a  load  appears  on  the  address  bus  at  the  start 
of  the  cycle  following  the  execution  of  the  load  instruction.  If  virtual  addressing  is  in 
use,  then  the  TLB  registers  are  used  to  perform  address  translation  during  the  second 
half  of  the  load  execute-cycle.  To  save  a  cycle,  the  Am29050  must  make  the  physical 
address  of  the  load  available  at  the  start  of  the  load  instruction  execution.  It  has  two 
ways  of  doing  this. 

The  access  address  of  a  load  instruction  is  specified  by  the  RB  field  of  the 
instruction  (see  Figure  1-13).  A  4-entry  Physical  Address  Cache  (PAC)  memory  is 
used  to  store  most  recent  load  addresses.  The  cache  entries  are  tagged  with  RB  field 
register  numbers.  When  a  load  instruction  enters  the  decode  stage  of  the  pipeline,  the 
RB  field  is  compared  with  one  of  the  PAC  entries,  using  a  direct  mapping  technique, 
with  the  lower  2-bits  of  the  register  number  being  used  to  select  the  PAC  entry.  When 
a  match  occurs  the  PAC  supplies  the  address  of  the  load,  thus  avoiding  the  delay  of 
reading  the  register  file  to  obtain  the  address  from  the  register  selected  by  the  RB  field 
of  the  LOAD  instruction.  If  a  PAC  miss  occurs,  the  new  physical  address  is  written  to 
the  appropriate  PAC  entry.  The  user  has  no  means  of  controlling  the  PAC;  its  opera¬ 
tion  is  completely  determined  by  the  processor  hardware. 

The  second  method  used  by  the  Am29050  processor  to  reduce  the  effect  of  pipe¬ 
line  stalling  occurring  as  a  result  of  memory  load  latency  is  the  Early  Address  Gener¬ 
ator  (EAG).  Load  addresses  are  frequently  formed  by  preceding  the  load  with 
CONST,  CONSTH  and  ADD  type  instructions.  These  instructions  prepare  a  general 
purpose  register  with  the  address  about  to  be  used  during  the  load.  The  EAG  circuitry 
continually  generates  addresses  formed  by  the  use  of  the  above  instructions  in  the 
hope  that  a  load  instruction  will  immediately  follow  and  use  the  address  newly 
formed  by  the  preceding  instructions.  The  EAG  must  make  use  of  the  TLB  address 
translation  hardware  in  order  to  make  the  physical  address  available  at  the  start  of  the 
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load  instruction.  This  happens  when,  fortunately,  the  RB  field  of  the  load  instruction 
matches  with  the  destination  register  of  the  previous  address  computation  instruc¬ 
tions. 

Software  debugging  is  better  supported  on  the  Am29050  processor  than  on  any 
other  current  29K  family  member.  All  29K  processors  have  a  trace  facility  which  en¬ 
ables  single  stepping  of  processor  instructions.  However,  prior  to  the  Am29050  pro¬ 
cessor,  tracing  did  not  apply  to  the  processor  operation  while  the  DA  bit  (disable  all 
traps  and  interrupts)  was  set  in  the  current  processor  status  (CPS)  register.  The  DA  bit 
is  typically  set  while  the  processor  is  operating  in  Freeze  mode  (FZ  bit  set  in  the  CPS 
register).  Freeze  mode  code  is  used  during  the  entry  and  exit  of  interrupt  and  trap 
handlers,  as  well  as  other  critical  system  support  code.  The  introduction  of  Monitor 
mode  operation  with  the  Am29050  enables  tracing  to  be  extended  to  Freeze  mode 
code  debugging.  The  processor  enters  Monitor  mode  when  a  synchronous  trap  oc¬ 
curs  while  the  DA  bit  is  set.  The  processor  is  equipped  with  a  second  set  of  PC  buffer 
registers,  known  as  the  shadow  PC  registers,  which  record  the  PC-bus  activity  while 
the  processor  is  operating  in  Monitor  mode.  The  first  set  of  PC  buffer  registers  have 
their  values  frozen  when  Freeze  mode  is  entered. 

The  addition  of  two  hardware  breakpoint  registers  aids  the  Am29050  debug 
support.  As  instructions  move  into  the  execute  stage  of  the  processor  pipeline,  the 
instruction  address  is  compared  with  the  break  address  values.  The  processor  takes  a 
trap  when  a  match  occurs.  Software  debug  tools,  such  as  monitors  like  Mini- 
MON29K,  used  with  other  29K  family  members,  typically  use  illegal  instructions  to 
implement  breakpoints.  The  use  of  breakpoint  registers  has  a  number  of  advantages 
over  this  technique.  Breakpoints  can  be  placed  in  read-only  memories,  and  break  ad¬ 
dresses  need  not  be  physical  but  virtual,  tagged  with  the  per-process  identifier. 


1 .5  THE  Am29030  2-BUS  MICROPROCESSOR 

The  Am29030™  processor  is  pin  compatible  with  other  2-bus  members  of  the 
family  (see  Table  1-2)  [AMD  1991b].  It  was  the  first  member  of  the  2-bus  family 
introduced  in  1991.  Higher  device  construction  densities  enable  it  to  offer  high  per¬ 
formance  with  a  simplified  system  interface  design.  From  a  software  point  of  view 
the  main  differences  between  it  and  the  Am29000  processor  occur  as  a  result  of  re¬ 
placing  the  Branch  Target  Cache  (BTC)  memory  with  8k  bytes  of  instruction  cache, 
and  connecting  the  instruction  and  data  busses  together  on-chip.  However,  the  sys¬ 
tem  interface  busses  have  gained  a  number  of  important  new  capabilities. 

The  inclusion  of  an  instruction  cache  memory  reduces  off-chip  instruction 
memory  access  bandwidth  requirements.  This  enables  instructions  to  be  fetched  via 
the  same  device  pins  used  by  the  data  bus.  Only  when  instructions  can  not  be  supplied 
by  the  cache  is  there  contention  for  access  to  external  memory.  Research  [Hill  1987] 
has  shown  that  with  cache  sizes  above  4k  bytes,  a  conventional  instruction  cache  is 
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Table  1-2.  Pin  Compatible  2-bus  29K  Family  Processors 


Processor 

Am29030 

Am29035 

A m 29040 

Instruction  Cache 

8K  bytes 

4K  bytes 

8K  bytes 

I-Cache  Associativity 

2-Way 

Direct-Mapped 

2-Way 

Date  Cache  (Physical) 

- 

- 

4K  bytes 

D-Cache  Associativity 

- 

- 

2-Way 

On-Chip  Floating-Point 

No 

No 

No 

On-Chip  MMU 

Yes 

Yes 

Yes 

Integer  Multiply  in  h/w 

No 

No 

Yes,  2-cycles 

Narrow  Memory  Reads 

Yes,  8/16  bit 

Yes,  8/16  bit 

Yes,  8/16  bit 

Programmable  Bus  Sizing 

No 

Yes,  16/32  bit 

Yes,  16/32  bit 

On-Chip  Interrupt 

Yes 

Yes 

Yes 

Controller  Input’s 

6 

6 

6 

Scalable  Clocking 

lx,2x 

lx,2x 

lx,2x 

Burst-mode  Addressing 

Yes,  up  to  IK  bytes 

Yes,  up  to  IK  bytes 

Yes,  up  to  IK  bytes 

Freeze  Mode  Processing 

Yes 

Yes 

Yes 

Delayed  Branching 

Yes 

Yes 

Yes 

On-Chip  Timer 

Yes 

Yes 

Yes 

On-Chip  Memory  Controler 

No 

No 

No 

DMA  Channels 

- 

- 

- 

Byte  Endian 

Big/Little 

Big/Little 

Big/Little 

JTAG  Debugging 

Yes 

Yes 

Yes 

Clock  Speeds  (MHz) 

20,25,33 

16 

0-33,40,50 
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more  effective  than  a  BTC.  At  these  cache  sizes  the  bandwith  requirements  are  suffi¬ 
ciently  reduced  as  to  make  a  shared  instruction/data  bus  practicable. 

Each  cache  entry  (known  as  a  block)  contains  four  consecutive  instructions. 
They  are  tagged  in  a  similar  manner  to  the  BTC  mechanism  of  the  Am29000  proces¬ 
sor.  This  allows  cache  entries  to  be  used  for  both  User  mode  and  Supervisor  mode 
code  at  the  same  time,  and  entries  to  remain  valid  during  application  system  calls  and 
system  interrupt  handlers.  However,  since  entries  are  not  tagged  with  per-process 
identifiers,  the  cache  entries  must  be  invalidated  when  a  task  context  switch  occurs. 
The  cache  is  2-way  set  associative.  The  4k  bytes  of  instruction  cache  provided  by 
each  set  results  in  256  entries  per  set  (each  entry  being  four  instructions,  i.e.  16  bytes). 

When  a  branch  instruction  is  executed  and  the  block  containing  the  target 
instruction  sequence  is  not  found  in  the  cache,  the  processor  fetches  the  missing 
block  and  marks  it  valid.  Complete  blocks  are  always  fetched,  even  if  the  target 
instruction  lies  at  the  end  of  the  block.  However,  the  cache  forwards  instructions  to 
the  decoder  without  waiting  for  the  block  to  be  reloaded.  If  the  cache  is  not  disabled 
and  the  block  to  be  replaced  in  the  cache  is  not  valid-and-locked,  then  the  fetched 
block  is  placed  in  the  cache.  The  2-way  cache  associativity  provides  two  possible 
cache  blocks  for  storing  any  selected  memory  block.  When  a  cache  miss  occurs,  and 
both  associated  blocks  are  valid  but  not  locked,  a  block  is  chosen  at  random  for  re¬ 
placement. 

Locking  valid  blocks  into  the  cache  is  not  provided  for  on  a  per-block  basis  but 
in  terms  of  the  complete  cache  or  one  set  of  the  two  sets.  When  a  set  is  locked,  valid 
blocks  are  not  replaced;  invalid  blocks  will  be  replaced  and  marked  valid  and  locked. 
Cache  locking  can  be  used  to  preload  the  cache  with  instruction  sequences  critical  to 
performance.  However,  it  is  often  difficult  to  use  cache  locking  in  a  way  that  can  out¬ 
perform  the  supported  random  replacement  algorithm. 

The  processor  supports  Scalable  Clocking™  which  enables  the  processor  to  op¬ 
erate  at  the  same  or  twice  the  speed  of  the  off-chip  memory  system.  A  33  MHz  pro¬ 
cessor  could  be  built  around  a  20  MHz  memory  system,  and  depending  on  cache  uti¬ 
lization  there  may  be  little  drop-off  in  performance  compared  to  having  constructed 
a  33  MHz  memory  system.  This  provides  for  higher  system  performance  without  in¬ 
creasing  memory  system  costs  or  design  complexity.  Additionally,  a  performance 
upgrade  path  is  provided  for  systems  which  were  originally  built  to  operate  at  lower 
speeds.  The  processor  need  merely  be  replaced  by  a  pin-compatible  higher  frequen¬ 
cy  device  (at  higher  cost)  to  realize  improved  system  performance. 

Memory  system  design  is  further  simplified  by  enforcing  a  2-cycle  minimum 
access  time  for  data  and  instruction  accesses.  Even  if  1-cycle  burst-mode  is  sup¬ 
ported  by  a  memory  system,  the  first  access  in  the  burst  is  hardwired  by  the  processor 
to  take  2-cycles.  This  is  effective  in  relaxing  memory  system  timing  constraints  and 
generally  appreciated  by  memory  system  designers.  The  high  frequency  operation  of 
the  Am29030  processor  can  easily  result  in  electrical  noise  [AMD  1 992c] .  Enforcing 
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2-cycle  minimum  access  times  ensures  that  the  address  bus  has  more  time  to  settle 
before  the  data  bus  is  driven.  This  reduces  system  noise  compared  with  the  data  bus 
changing  state  during  the  same  cycle  as  the  address  bus. 

At  high  processor  clock  rates,  it  is  likely  that  an  interleafed  memory  system  will 
be  required  to  obtain  bandwidths  able  to  sustain  1-cycle  burst  mode  access.  Inter- 
leafing  requires  the  construction  of  two,  four  or  more  memory  systems  (known  as 
banks),  which  are  used  in  sequence.  When  accessed  in  burst-mode,  each  bank  is  giv¬ 
en  more  time  to  provide  access  to  its  next  storage  location.  The  processor  provides  an 
input  pin,  EARLYA  (early  address),  by  which  a  memory  system  can  request  early  ad¬ 
dress  generation  by  the  processor.  This  can  be  used  to  simplify  the  implementation  of 
interleaved  memory  systems.  When  requested,  the  processor  provides  early  the  ad¬ 
dress  of  even-addressed  banks,  allowing  the  memory  system  to  begin  early  accesses 
to  both  even-  and  odd-addressed  banks. 

The  processor  can  operate  with  memory  devices  which  are  not  the  full  32-bit 
width  of  the  data  bus.  This  is  achieved  using  the  Narrow  Read  capability.  Memory 
systems  which  are  only  8-bit  or  16-bit  wide  are  connected  to  the  upper  bits  of  the 
data/instruction  bus.  They  assert  the  RDN  (read  narrow)  input  pin  along  with  the 
RDY  (ready)  pin  when  responding  to  access  requests.  When  this  occurs  the  processor 
will  automatically  perform  the  necessary  sequences  of  accesses  to  assemble  instruc¬ 
tions  or  data  which  are  bigger  than  the  memory  system  width. 

The  Narrow  Read  ability  can  not  be  used  for  data  writing.  However,  it  is  very 
useful  for  interfacing  to  ROM  which  contains  system  boot-up  code.  Only  a  single 
8-bit  ROM  may  be  required  to  contain  all  the  necessary  system  initialization  code. 
This  can  greatly  simplify  system  design,  board  space,  and  cost.  The  ROM  can  be  used 
to  initialize  system  RAM  memory  which,  due  to  its  32-bit  width,  will  permit  faster 
execution. 

1.5.1  Am29030  Evaluation. 

AMD  provides  a  low  cost  evaluation  board  for  the  Am29030  at  1 6  MHz,  known 
as  the  EZ030  (pronounced  easy-030) .  Like  the  microcontroller  evaluation  board,  it  is 
a  standalone,  requiring  an  external  5v  power  supply  and  connection  to  a  remote  com¬ 
puter  via  an  RS-232  connection.  The  board  is  very  small,  measuring  about  4  inches 
by  4  inches  (10x10  cm).  The  memory  system  is  restricted  to  16  MHz  operation  but 
with  scalable  clocking  the  processor  can  run  at  16  MHz  or  33  MHz. 

It  contains  128k  bytes  of  EPROM,  which  is  accessed  via  8-bit  narrow  bus  proto¬ 
col.  There  is  also  1M  byte  of  DRAM  arranged  as  256kx32  bits.  The  DRAM  is  ex¬ 
pandable  to  4M  bytes.  The  EPROM  is  preprogrammed  with  the  MiniMON29K  de¬ 
bug  monitor  and  the  OS-boot  operating  system  described  in  Chapter  7. 

1.5.2  The  Am29035 

The  Am29035™  processor  is  pin  compatible  with  other  2-bus  members  of  the 
family  (see  Table  1-2).  As  would  be  expected,  given  the  AMD  product  number,  its 
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operation  is  very  similar  to  the  Am29030  processor.  It  is  only  available  at  lower  clock 
frequencies,  compared  with  its  close  relative.  And  with  half  the  amount  of  instruction 
cache  memory,  it  contains  one  set  of  the  two  sets  provided  by  the  Am29030.  That  is,  it 
has  4k  bytes  of  instruction  memory  cache  which  is  directly  mapped.  Consequently  it 
can  be  expected  to  operate  with  reduced  overall  performance. 

In  all  other  aspects  it  is  the  same  as  the  Am29030  processor,  except  it  has  Pro¬ 
grammable  Bus  Sizing  which  the  Am29030  processor  does  not.  Programmable  Bus 
Sizing  provides  for  lower  cost  system  designs.  The  processor  can  be  dynamically 
programmed  (via  the  configuration  register)  to  operate  with  a  16-bit  instruction/data 
bus,  performing  both  read  and  write  operations.  When  the  option  is  selected,  32-bit 
data  is  accessed  by  the  processor  hardware  automatically  performing  two  consecu¬ 
tive  accesses.  The  ability  to  operate  with  16-bit  and  32-bit  memory  systems  makes 
the  2-bus  29K  family  members  well  suited  to  scalable  system  designs,  in  terms  of 
cost  and  performance. 


1 .6  THE  Am29040  2-BUS  MICROPROCESSOR 

The  Am29040™  processor  is  pin  compatible  with  other  2-bus  members  of  the 
family  (see  Table  1-2).  The  processor  was  introduced  in  1994  and  offers  higher  per¬ 
formance  than  the  2-bus  Am29030;  it  also  has  a  number  of  additional  system  support 
facilities. 

There  is  an  enhanced  instruction  cache,  now  8k  bytes;  which  is  tagged  in  much 
the  same  way  as  the  Am29030’s  instruction  cache,  except  there  are  four  valid  bits  per 
cache  block  (compared  to  the  Am29030’s  one  bit  per  block).  Partially  filled  blocks 
are  supported,  and  block  reload  begins  with  the  first  required  instruction  (target  of  a 
branch)  rather  than  the  first  instruction  in  the  block.  An  additional  benefit  of  having  a 
valid  bit  per-instruction  rather  than  per-block  is  that  load  or  store  instructions  can 
interrupt  cache  reload.  With  the  Am29030  processor,  once  cache  reload  had  started, 
it  could  not  be  postponed  or  interrupted  by  a  higher  priority  LOAD  instruction. 

The  Am29040  was  the  first  29K  microprocessor  to  have  a  data  cache.  The  4k 
byte  data  cache  is  physically  addressed  and  supports  both  “copy-back”  and  “write- 
through”  policies.  Like  other  29K  Family  members,  the  data  cache  always  operates 
with  physical  addresses  and  cache  blocks  are  only  allocated  on  LOAD  instructions 
which  miss  (a  “read-allocate”  or  “load-allocate”  policy).  The  block  size  is  16  bytes 
and  there  is  one  valid  bit  per  block.  This  means  that  complete  data  blocks  must  be 
fetched  when  data  cache  reload  occurs.  Burst  mode  addressing  is  used  to  reload  a 
block,  starting  with  the  first  word  in  the  block.  The  addition  of  a  data  cache  makes  the 
Am29040  particularly  well-suited  to  high-performance  data  handling  applications. 

The  default  data  cache  policy  is  “copy-back”.  A  four  word  copy-back  buffer  is 
used  to  improve  the  performance  of  the  copy-back  operation.  Additionally,  cache 
blocks  have  an  M  bit-field,  which  becomes  set  when  data  in  the  block  is  modified.  If 


Chapter  1  Architectural  Overview 


17 


the  M  bit  is  not  set  when  a  cache  block  is  reallocated,  the  out-going  block  is  not  co¬ 
pied  back. 

When  data  cache  is  added  to  a  processor,  there  can  be  difficulties  dealing  with 
data  consistency.  Problems  arise  when  there  is  more  than  one  processor  or  data  con¬ 
troller  (such  as  a  DMA  controller)  accessing  the  same  memory  region.  The  Am29040 
processor  uses  bus  snooping  to  solve  this  problem.  The  method  relies  on  the  proces¬ 
sor  monitoring  all  accesses  performed  on  the  memory  system.  The  processor  inter¬ 
venes  or  updates  its  cache  when  an  access  is  attempted  on  a  currently  cached  data 
value.  Cache  consistency  is  dealt  with  in  detail  in  section  5.14.4. 

Via  the  MMU,  each  memory  page  can  be  separately  marked  as  “non  cached”, 
“copy-back”,  or  “write-through”.  A  two  word  write-through  buffer  is  used  to  assist 
with  writes  to  memory.  It  enables  multiple  store  instructions  to  be  in-execution  with¬ 
out  the  processor  pipeline  stalling.  Data  accesses  which  hit  in  the  cache  require 
2-cycle  access  times.  Two  cycles,  rather  than  one,  are  required  due  to  the  potentially 
high  internal  clock  speed.  The  data  cache  operation  is  explained  in  detail  in  section 
5.14.2.  However,  load  instructions  do  not  cause  pipeline  stalling  if  the  instruction  im¬ 
mediately  following  the  load  does  not  require  the  data  being  accessed. 

Scalable  bus  clocking  is  supported;  enabling  the  processor  to  run  at  twice  the 
speed  of  the  off-chip  memory  system.  Scalable  Clocking™  was  first  introduced  with 
the  Am29030  processors,  and  is  described  in  the  previous  section  describing  the 
Am29030.  If  cache  hit  rates  are  sufficiently  high,  Scalable  Clocking  enables  high 
performance  systems  to  be  built  around  relatively  slow  memory  systems.  It  also  of¬ 
fers  an  excellent  upgrade  path  when  additional  performance  is  required  in  the  future. 
The  maximum  on-chip  clock  speed  is  50  MHz. 

The  Am29040  processor  supports  integer  multiply  directly.  A  latency  of  two 
cycles  applies  to  integer  multiply  instructions  (most  29K  instructions  require  only 
one  cycle).  Again,  this  is  a  result  of  the  potentially  high  internal  clocking  speeds  of 
the  processor.  Most  29K  processors  take  a  trap  when  an  integer  multiply  is  attempted. 
It  is  left  to  trapware  to  emulate  the  missing  instruction.  The  ability  to  perform  high 
speed  multiply  makes  the  processor  a  better  choice  for  calculation  intensive  applica¬ 
tions  such  as  digital  signal  processing.  Note,  floating-point  performance  should  also 
improve  with  the  Am29040  as  floating-point  emulation  routines  can  make  use  of  the 
integer  multiply  instruction. 

The  Am29040  has  two  Translation  Look-Aside  Buffers  (TLBs).  Having  two 
TLBs  enables  a  larger  number  of  virtual  to  physical  address  translations  to  be  cached 
(held  in  a  TLB  register)  at  any  time.  This  reduces  the  TLB  reload  overhead.  The  TLB 
format  is  similar  to  the  arrangement  used  with  the  Am29243  microcontroller.  Each 
TLB  has  16  entries  (8  sets,  two  entries  per  set).  The  page  size  used  by  each  TLB  can 
be  the  same  or  different.  If  the  TLB  page  sizes  are  the  same,  a  four-way  set  associa¬ 
tive  MMU  can  be  constructed  with  supporting  software.  Alternatively  one  TLB  can 
be  used  for  code  and  the  second,  with  a  larger  page  size,  for  data  buffers  or  shared 
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libraries.  The  TLB  entries  have  a  Global  Page  (GLB)  bit;  when  set  the  mapped  page 
can  be  accessed  by  any  processes  regardless  of  its  process  identifier  (PID).  The  TLB 
also  enables  parity  checking  to  be  enabled  on  a  per  page  basis;  and  pages  can  be  allo¬ 
cated  from  16-bit  or  32-bit  wide  memory  regions. 

On-chip  debug  support  is  extended  with  the  inclusion  of  two  Instruction  Break¬ 
point  Controllers  and  one  Data  Breakpoint  Controller.  This  enables  inexpensive  de¬ 
bug  monitors  such  as  the  DebugCore  incorporated  within  MiniMON29K  to  be  used 
when  developing  software.  Breakpoints  are  supported  when  physical  or  virtual  ad¬ 
dressing  is  in  use.  The  JTAG  test  interface  has  also  been  extended  over  other  29K 
family  members  to  include  several  new  JTAG-processed  instructions.  The  effective¬ 
ness  of  the  JTAG  interface  for  hardware  and  software  debugging  is  improved. 

The  Am29040  family  grouping  is  implemented  with  a  silicon  process  which  en¬ 
ables  processors  to  operate  at  3.3-volts.  However,  the  device  is  tolerant  of  5-volt  in¬ 
put/output  signal  levels.  The  lower  power  consumption  achievable  at  3.3-volts 
makes  the  Am29040  suitable  for  hand-held  type  applications.  Note,  the  device  oper¬ 
ates  at  a  maximum  clock  frequency  of  50  MHz. 

A  29K  processor  enters  Wait  Mode  when  the  Wait  Mode  bit  is  set  in  the  Current 
Processor  Status  (CPS)  register.  Wait  Mode  is  extended  to  include  a  Snooze  Mode 
which  is  entered  from  Wait  Mode  while  the  interrupt  and  trap  input  lines  are  held  in¬ 
active.  An  interrupt  is  normally  used  to  depart  Wait  or  Snooze  Mode.  While  in 
Snooze  mode,  Am29040  power  consumption  is  reduced.  Returning  from  Snooze 
mode  to  an  interrupt  processing  state  requires  approximately  256  cycles.  The  proces¬ 
sor  can  be  prevented  from  entering  Snooze  Mode  while  in  Wait  Mode  by  holding,  for 
example,  the  INTR3  input  pin  active  and  setting  the  interrupt  mask  such  as  to  disable 
the  INTR3  interrupt. 

If  the  input  clock  is  held  high  or  low  while  the  processor  is  in  Snooze  mode, 
Sleep  Mode  is  entered.  Minimum  power  consumption  occurs  in  this  mode.  The  pro¬ 
cessor  returns  to  Snooze  Mode  when  the  input  clock  is  restarted.  Using  Snooze  and 
Sleep  modes  enables  the  Am29040  processor  to  be  used  in  applications  which  are 
very  power  sensitive. 

1.6.1  Am29040  Evaluation. 

Like  any  29K  processor,  the  Am29040  can  be  evaluated  using  the  Architectural 
Simulator.  But  for  those  who  wish  for  real  hardware,  AMD  manufactures  a  number 
of  evaluation  boards.  The  most  popular  being  the  SE29040  evaluation  board.  The 
board,  originally  constructed  in  rev-A  form,  supports  4M  bytes  of  DRAM  (expand¬ 
able  to  64M  bytes);  DRAM  timing  is  3/1,  i.e.  3-cycle  first  access  then  1-cycle  burst. 
There  is  also  1M  byte  of  32-bit  wide  ROM  and  space  for  1M  byte  of  2/1  SRAM. 
Boards  are  typically  populated  with  only  128K  of  SRAM.  The  memory  system  clock 
speed  is  25  MHZ  and  the  maximum  processor  speed  of  50  MHz  is  supported. 
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There  are  connections  for  JTAG  and  a  logic  analyzer  as  well  as  two  UARTs  via 
an  85C30  serial  communications  controller.  The  board  requires  a  5-volt  power  sup¬ 
ply  and  there  is  a  small  wire-warp  area  for  placement  of  additional  system  compo¬ 
nents. 

The  later  rev-B  boards  have  an  additional  parallel  port  and  Ethernet  connection 
(10-base-T).  An  AMD  HiLANCE  is  used  for  Ethernet  communication.  The  rev-B 
board  can  also  support  memory  system  speeds  up  to  33  MHz. 

1 .7  A  SUPERSCALAR  29K  PROCESSOR 

AMD  representatives  have  talked  at  conferences  and  to  the  engineering  press 
about  a  superscalar  29K  processor.  No  announcements  have  yet  been  made  about 
when  such  a  processor  will  be  available,  but  it  is  generally  expected  to  be  in  the  near 
future.  At  the  1994  Microprocessor  Forum,  AMD  presented  a  product  overview,  but 
much  of  the  specific  details  about  the  processor  architecture  were  not  announced. 
However,  piecing  together  available  information,  it  is  possible  to  form  ideas  about 
what  a  superscalar  29K  would  look  like. 

This  section  does  not  describe  a  specific  processor,  but  presents  the  superscalar 
techniques  which  are  likely  to  be  utilized.  A  lead  architect  of  the  29K  family,  Mike 
Johnson,  has  a  text  book  dealing  with  “Superscalar  Microprocessor  Design”  ([John¬ 
son  1991])  which  covers  the  technology  in  depth.  It  might  be  expected  that  many  of 
the  conclusions  drawn  in  Johnson’s  book  will  appear  in  silicon  in  a  future  29K  pro¬ 
cessor. 

AMD  has  stated  that  future  microprocessors  will  be  pin  compatible  with  the  cur¬ 
rent  2-bus  family  members.  This  indicates  that  a  superscalar  29K  will  be  pin  compat¬ 
ible  with  the  Am29030  and  Am29040  processors.  It  is  much  more  likely  that  the  pro¬ 
cessor  will  take  2-bus  form  rather  than  a  microcontroller.  User  mode  instruction 
compatibility  can  also  be  expected.  Given  the  usual  performance  increments  that  ac¬ 
company  a  new  processors  introduction,  it  will  likely  sustain  two-times  the  perfor¬ 
mance  of  an  Am29040  processor.  This  may  be  an  underestimate,  as  higher  clock  rates 
or  increased  use  of  Scalable  Clocking  may  allow  for  even  higher  performance.  The 
processor  is  certain  to  have  considerable  on-chip  instruction  and  data  cache.  AMD’s 
product  overview  indicates  that  2x,  3x  and  4x  Scalable  Clocking  will  be  supported 
and  there  will  be  an  8K  byte  instruction  cache  and  an  8K  byte  data  cache.  Also  re¬ 
ported  was  an  internal  clock  speed  up  to  100  MHz  at  3.3-volts. 

A  superscalar  processor  achieves  higher  performance  than  a  conventional  sca¬ 
lar  processor  by  executing  more  than  one  instruction  per  cycle.  To  achieve  this  it  must 
have  multiple  function  units  which  can  operate  in  parallel.  AMD  has  indicated  that 
the  initial  superscalar  29K  processor  will  have  six  function  units.  And  since  about 
50%  of  instructions  perform  integer  operations,  there  will  be  two  integer  operation 
units,  one  integer  multiplier  and  one  funnel  shifter.  If  a  future  the  processor  supports 
floating-point  operations  directly,  we  can  expect  to  see  a  floating-point  execution 
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unit  added.  Other  execution  units  are  included  to  deal  with  off-chip  access  via  load 
and  store  instructions;  and  to  deal  with  branch  instruction  execution.  All  six  function 
units,  except  the  integer  multiplier,  produce  their  results  in  a  single-cycle. 

High  speed  operation  can  only  be  obtained  if  as  many  as  possible  of  the  function 
units  can  be  kept  productively  busy  during  the  same  processor  cycles.  This  will  place 
a  heavy  demand  on  instruction  decoding  and  operand  forwarding.  Several  instruc¬ 
tions  will  have  to  be  decoded  in  the  same  cycle  and  forwarded  to  the  appropriate 
execution  unit.  The  demand  for  operands  for  these  instructions  shall  be  considerably 
higher  than  must  be  dealt  with  by  a  scalar  processor.  The  following  sections  describe 
some  of  the  difficulties  encountered  when  attempting  to  execute  more  than  one 
instruction  per  cycle.  Architectural  techniques  which  overcome  the  inherent  difficul¬ 
ties  are  presented. 

1.7.1  Instruction  Issue  and  Data  Dependency 

The  term  instructions  issue  refers  to  the  passing  of  an  instruction  from  the  pro¬ 
cessor  decode  stage  to  an  execution  unit.  With  a  scalar  processor,  instructions  are  is¬ 
sued  in-order.  By  that,  I  mean,  in  the  order  the  decoder  received  the  instructions  from 
cache  or  off-chip  memory.  Instructions  naturally  complete  in-order.  However  with  a 
RISC  processor  out-of-order  completion  is  not  unusual  for  certain  instructions.  Typ¬ 
ically  load  and  store  instructions  are  allowed  to  execute  in  parallel  with  other  instruc¬ 
tions.  These  instructions  are  issued  in-order;  they  don’t  complete  immediately  but 
some  time  (a  few  cycles)  later.  The  instructions  following  loads  or  stores  are  issued 
and  execute  in  parallel  unless  there  is  any  data  dependencies.  Dependencies  arise 
when,  for  example,  a  load  instructions  is  followed  by  an  operation  on  the  loaded  data. 

A  superscalar  processor  can  reduce  total  execution  time  for  a  code  sequence  if  it 
allows  all  instruction  types  to  complete  out-of-order.  Instruction  issue  need  not  stop 
after  an  instruction  is  issued  to  a  function  unit  which  takes  multiple  cycles  to  com¬ 
plete.  Consequently,  function  units  with  long  latency  may  complete  their  operation 
after  a  subsequent  instruction  issued  to  a  low  latency  function  unit.  The  Am29050 
processor  allows  long  latency  floating-point  operations  to  execute  in  parallel  with 
other  integer  operations.  The  processor  has  an  additional  port  on  it’s  register  file  for 
writing-back  the  results  of  floating-point  operations.  An  additional  port  is  required 
to  avoid  the  contention  which  would  arise  with  an  integer  operation  writing  back  its 
result  at  the  same  time.  Most  instructions  are  issued  to  an  integer  unit  which,  with  a 
RISC  processor,  has  only  one  cycle  latency.  However,  there  is  very  likely  to  be  more 
than  one  integer  unit,  each  operating  in  parallel. 
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Write-Read  Dependency 

Even  if  a  processor  is  able  to  support  out-of-order  instruction  completion,  it 
still  must  deal  with  the  data  dependencies  that  flow  through  a  program’s  execution. 
These  flow  dependencies  (often  known  as  true  dependencies)  represent  the  move¬ 
ment  of  operands  between  instructions  in  a  program.Examine  the  code  below: 


mul  gr96,lr2,lr5  ;write  gr96,  gr96  =  lr2  *  lr5 

add  gr97,gr96,l  ; read  gr96, 

The  first  instruction  would  be  issued  to  the  integer 
multiply  unit;  this  will  have  (according  to  AMD’s 
product  overview)  two  cycles  of  latency.  The  result  is 
written  to  register  gr96.  The  second  instruction  would 
be  issued  to  a  different  integer  handling  unit.  However, 
it  has  a  source  operand  supplied  in  gr96.  If  the  second 
instruction  had  no  data  dependencies  on  the  first,  it 
would  be  easy  to  issue  the  instruction  while  the  first  was 
still  in  execute.  However,  execution  of  the  first 
instruction  must  complete  before  the  second  instruction 
can  start  execution.  Steps  must  be  taken  to  deal  with  the 
data  dependency.  This  kind  of  dependency  is  also  know 
as  write-read  dependency,  because  gr96  must  be 
written  by  an  earlier  instruction  before  a  later  one  can 
read  the  result. 

Some  superscalar  processors,  such  as  the 
reduced-scoreboarding  mechanism  to  resolve  data  dependances  [Thorton  1970]. 
When  a  register  is  required  for  a  result,  a  one-bit  flag  is  set  to  indicate  the  register  is  in 
use.  Currently  in-execute  instructions  set  the  scoreboard  bit  for  their  result  registers. 
Before  an  instruction  is  issued  the  scoreboard  bit  is  examined.  Further  instructions 
are  not  issued  if  the  scoreboard  indicates  that  an  in-execute  instruction  intends  to 
write  a  register  which  supplies  a  source  operand  for  the  instruction  waiting  for  issue. 
When  an  instruction  completes,  the  relevant  scoreboard  bit  is  cleared.  This  may 
result  in  a  currently  stalled  instruction  being  issued. 

It  is  unlikely  a  29K  processor  will  use  scoreboarding;  and  even  less  likely  it  will 
use  a  reduced-scoreboarding  mechanism,  such  as  the  i960  CA,  which  only  detects 
data  dependency  for  out-of-order  instruction  completion.  A  superscalar  29K 
processor  will  support  out-of-order  instruction  issue,  which  is  described  shortly. 
Scoreboarding  can  resolve  the  resulting  data  dependencies.  However,  other 
techniques,  such  as  register  renaming,  enable  instructions  to  be  decoded  and  issued 
further  ahead  than  is  possible  with  scoreboarding.  This  will  be  described  in  more 
detail  as  we  proceed. 


write-read  dependency 


^  gr97 


Intel  i960  CA,  use  a 
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Write-Write  Dependency 

A  second  type  of  data  dependency  can  complicate  out-of-order  instruction 
completion.  Examine  the  code  sequence  shown  below: 

mul  gr96,lr2,lr5  ;write  gr96,  gr96  =  lr2  *  lr5 

add  gr97, gr96, 1 

add  gr96, lr5, 1  ;write  gr96,  write-write  dependency 

The  result  of  the  third  instruction  has  an  output  dependency  on  the  first 

instruction.  The  third  instruction  can  not  complete  before  the  first.  Both  instructions 
write  their  results  to  register  gr96,  and  completing  the  first  instruction  last  would 
result  in  an  out-of-date  value  being  held  in  gr96.  Steps  must  be  taken  to  deal  with  the 
data  dependency.  Because  the  completion  of  multiple  instructions  is  dependent  on 
writing  gr96  with  the  correct  value,  this  kind  of  dependence  is  also  known  as  a 
write-write  dependance. 

Scoreboarding  or  reduced-scoreboarding  can  also  resolve  write-write 
dependences.  Before  an  instruction  is  issued,  the  scoreboard  bit  for  the  result  register 
is  tested.  If  there  is  a  currently  in-execute  instruction  planning  on  writing  to  the  same 
result  register,  the  scoreboard  bit  will  be  set.  This  information  can  be  used  to  stall 
issuing  until  the  result  register  is  available. 

The  parallel  execution  possible  with  out-of-order  completion,  enables  higher 
performance  than  in-order  completion,  but  extra  logic  is  required  to  deal  with  data 
dependency  checking.  With  in-order  instruction  issue,  instructions  can  no  longer  be 
issued  when  a  dependency  is  detected.  If  instruction  issue  is  to  continue  when  data 
dependencies  are  present,  the  processor  architecture  becomes  yet  more  complicated; 
but  the  performance  reward  is  extended  beyond  that  of  out-of-order  completion 
with  in-order  issue. 

Read-Write  Dependency 

Instruction  issuing  can  continue  even  when  the  write-read  and  write-write 
dependencies  described  above  are  present.  The  preceding  discussion  on  data 
dependency  was  restricted  to  in-order  instruction  issue.  Certainly,  when  a  data 
dependency  is  detected,  the  unfortunate  instruction  can  not  be  issued;  but  this  need 
not  mean  that  future  instructions  can  not  be  issued.  Of  course  the  future  instruction 
must  be  free  of  any  dependencies.  With  out-of-order  instruction  issue,  instructions 
are  decoded  and  placed  in  an  instruction  window.  Instructions  can  be  issued  from  the 
window  when  they  are  free  of  dependencies  and  there  is  an  available  function  unit. 

The  processes  of  decoding  and  executing  an  instruction  is  separated  by  the 
instruction  window,  see  Figure  1-5.  This  does  not  add  an  additional  pipeline  stage  to 
the  superscalar  processor.  The  decoder  places  instructions  into  the  window.  When  an 
instruction  is  free  of  dependencies  it  can  be  issued  from  the  window  to  a  function  unit 
for  execution.  The  register  window  could  be  implemented  as  a  large  buffer  within  the 
instruction  decode  unit,  but  this  leads  to  a  complex  architecture.  When  an  instruction 
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is  issued,  the  op-code  and  operands  must  be  communicated  to  the  function  unit. 
When  multiple  instructions  are  issued  in  a  single  cycle,  a  heavy  demand  is  placed  on 
system  busses  and  register  file  access  ports.  An  alternative  window  implementation 
is  to  hold  instructions  at  the  function  units  in  reservation  stations.  This  way 
instructions  are  sent  during  decode  to  the  appropriate  function  unit  along  with  any 
available  operands.  They  are  issued  from  the  reservation  station  (really  the  window) 
when  any  remaining  dependencies  are  resolved  and  the  function  unit  is  available  for 
execution.  The  operation  of  reservation  stations  is  described  in  more  detail  in  section 
1.7.2. 


Instruction  Window 


Figure  1-5.  The  Instruction  Window  for  Out-of-Order  Instruction  Issue 


An  instruction  is  issued  from  the  window  when  its  operands  are  available  for 
execution.  Future  instructions  may  be  issued  ahead  of  earlier  instructions  which 
become  blocked  due  to  data  dependencies.  Executing  instructions  out-of-order 
introduces  a  new  form  of  data  dependency  not  encountered  with  in-order  instruction 
issue.  Examine  the  code  sequence  below: 


mul 

gr96, lr2, lr5 

;gr96  =  lr2  *  lr5 

add 

gr97, gr96, 1 

; read  gr96 

add 

gr96, lr5, 1 

; write  gr96,  read-write  dependency 

The  third  instruction  in  the  sequence  uses  gr96  for  its  result.  The  second 
instruction  receives  an  operand  in  the  same  gr96  register.  The  third  instruction  can 
not  complete  and  write  its  result  until  the  second  instruction  begins  execution; 
otherwise  the  second  will  receive  the  wrong  operand.  The  result  of  the  third 
instruction  has  an  antidependency  on  the  operand  to  the  second  instruction.  The 
dependency  is  very  much  like  an  in-order  issue  dependency  but  reversed.  This  kind 
of  dependency  is  also  know  as  read-write  dependance,  because  gr96  must  be  read  by 
the  second  instruction  before  the  third  can  write  its  result  to  gr96. 

Registers  are  used  to  hold  data  values.  The  flow  of  data  through  a  program  is 
represented  by  the  registers  accessed  by  instructions.  When  instructions  execute 
out-of-order;  the  flow  of  data  between  instructions  is  restricted  by  the  reuse  of 
registers  to  hold  different  data  values.  In  the  above  example  we  want  to  issue  the  third 
instruction  but  its  use  of  gr96  creates  a  problem.  The  second  instruction  is  receiving, 
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via  gr96,  a  data  value  produced  by  the  first  instruction.  The  register  label  gr96  is 
merely  used  as  an  identifier  for  the  data  flow.  What  is  intended  is  that  data  be  passed 
from  the  first  instruction  to  the  second.  If  our  intentions  could  be  communicated 
without  restricting  data  passing  to  gr96,  then  the  third  instruction  could  be  executed 
before  the  second. 

The  problem  can  be  overcome  by  using  register  renaming,  see  section  1.7.3. 
Briefly,  when  the  first  instruction  in  the  above  example  is  issued,  it  writes  its  result  to 
a  temporary  register  identified  by  the  name  gr96.  The  second  instruction  receives  its 
operand  from  the  same  temporary  register  used  by  the  first  instruction.  Execution  of 
the  third  instruction  need  not  be  stalled  if  it  writes  its  result  to  a  different  copy  of 
register  gr96.  So  now  there  are  multiple  copies  of  gr96.  What  really  happens  is 
temporary  registers  are  renamed  to  be  gr96  for  the  duration  of  the  data  flow.  These 
temporary  registers  play  the  role  of  registers  indicated  by  the  instruction  sequence. 
They  are  tagged  to  indicate  the  register  they  are  duplicating. 

1.7.2  Reservation  Stations 

Each  function  unit  has  a  number  of  reservation  stations  which  hold  instructions 
and  operands  waiting  for  execution,  see  Figure  1-6.  All  the  reservation  stations  for 
each  function  unit  combined  represent  the  instruction  window  from  which 
instructions  are  issued.  The  decoder  places  instructions  into  reservation  stations 
[Tomasulo  1 967]  with  copies  of  operands,  when  available.  Otherwise  operand  values 
are  replaced  with  tags  indicating  the  register  supplying  the  missing  data.  Placing  a 
copy  of  a  source  operand  into  the  reservation  station  when  an  instruction  is  decoded, 
prevents  the  operand  being  updated  by  a  future  instruction;  and  hence  eliminates 
anidependency  conflicts.  A  function  unit  issues  instructions  to  its  execute  stage  when 
it  is  not  busy  and  a  reservation  station  has  an  instruction  ready  for  execution.  Once  an 
instruction  is  placed  in  a  reservation  station,  its  issue  occurs  regardless  of  any 
instruction  issue  occurring  in  another  function  unit.  There  can  be  any  number  of 
reservation  stations  attached  to  a  function  unit.  The  greater  the  number,  the  larger  the 
instruction  window;  and  the  further  ahead  the  processor  can  decode  and  issue 
instructions.  Additionally,  a  greater  number  of  reservation  stations  prevents  short 
term  demands  on  a  function  unit  resulting  in  decoder-stalling. 

An  instruction  may  be  stalled  in  a  reservation  station  when  a  data  dependency 
causes  a  tag,  rather  than  data,  to  be  placed  in  the  operand  field.  The  necessary  data 
will  become  available  when  some  other  instruction  completes  and  the  result  made 
available.  The  instruction  producing  the  required  data  value  may  be  in  a  reservation 
station  or  in  execution  in  the  same  function  unit,  or  in  another  function  unit.  Result 
values  are  tagged  indicating  the  register  they  should  be  placed  in.  With  a  scalar 
processor,  the  result  is  always  written  to  the  instruction’s  destination  register.  But 
when  register  renaming  is  used  by  a  superscalar  processor,  results  are  written  to  a 
register  which  is  temporarily  playing  the  role  of  the  destination  register.  These 
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Figure  1-6.  A  Function  Unit  with  Reservation  Stations 


temporary  registers,  known  as  copy  or  duplicate  registers,  are  tagged  to  indicate  the 
real  register  they  are  duplicating. 

When  a  function  unit  completes  an  instruction,  it  places  the  result  along  with  the 
tag  information  identifying  the  result  register  on  a  result  bus.  If  several  function  units 
complete  in  the  same  cycle,  there  can  be  competition  for  the  limited  number  of  result 
busses.  Other  function  units  monitor  the  result  bus  (or  busses).  Their  intention  is  to 
obtain  the  missing  operands  for  instructions  held  in  reservation  stations.  When  they 
observe  a  data  valued  tagged  with  a  register  value  matching  a  missing  operand  they 
copy  the  data  into  the  reservation  station’s  operand  field.  This  may  enable  the 
instruction  to  be  issued. 

Once  an  instruction  is  placed  into  a  reservation  station  it  will  execute  in 
sequence  with  other  instructions  held  in  other  reservation  stations  within  the  same 
function  unit.  Of  course  exceptional  events,  or  the  placing  of  instructions  into  the 
instruction  window  which  represent  over  speculation ,  can  divert  the  planned 
execution.  The  instruction  window  supports  speculative  instruction  decoding.  It  is 
possible  that  a  branch  instruction  can  result  in  unsuccessful  speculation;  and  the 
window  must  be  refilled  with  instructions  fetched  from  a  new  instruction  sequence. 
If  a  superscalar  processor’s  performance  is  to  be  kept  high,  it  is  important  that 
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speculation  be  successful.  For  this  to  be  accomplished,  branch  prediction  techniques 
must  be  employed;  more  on  this  is  in  section  1.7.4. 

1.7.3  Register  Renaming 

It  was  briefly  described  in  the  previous  section  dealing  with  read-write 
dependency  (antidependency),  that  register  renaming  can  help  deal  with  the  conflicts 
which  arise  from  the  reuse  of  the  same  register  to  hold  data  values.  Of  course  these 
dependencies  only  arise  from  the  out-of-order  instruction  issue  which  occurs  with  a 
superscalar  processor.  Also  described  were  write-write  (output)  dependencies, 
which  occur  with  even  in-order  instruction  issue  when  more  than  one  instruction 
wishes  to  write  the  same  result  register.  Both  these  types  of  dependency  can  be 
grouped  under  the  heading  storage  conflicts.  Their  interference  with  concurrent 
instruction  execution  is  only  temporary.  Duplication  of  the  result  register  for  the 
duration  of  the  conflict  can  resolve  the  dependency  and  enable  superscalar 
instruction  execution  to  continue. 

The  temporary  result  registers  are  allocated  from  a  reorder  buffer  which 
consists  of  10  registers  and  supporting  tag  information.  Every  new  result  value  is 
allocated  a  new  copy  of  the  original  assignment  register.  Copies  are  tagged  to  enable 
them  to  be  used  as  source  operands  in  future  instructions .  Register  renaming  is  shown 
for  the  example  code  sequence  below. 


; original  code 

mul  gr96,lr2,lr5 

add  gr97, gr96, 1 

add  gr96, lr5, 1 


; code  after  register  renaming 

mul  RRl,lr2,lr5  ;gr96  =  lr2  *  lr5 
add  RR2 , RR1 , 1 
add  RR3 , lr5 , 1 


The  write-write  dependency  between  the  first  and  third  instruction  is  resolved 
by  renaming  register  gr96  to  be  register  RR3  in  the  third  instruction.  The  renaming 
gr96  to  be  RR3  in  the  third  instruction  also  resolves  the  read-write  dependency 
between  the  second  and  third  instruction  Using  register  renaming,  execution  of  the 
third  instruction  need  not  be  stalled  due  to  storage  (register)  dependency.  Figure  1-7 
shows  the  dependencies  before  and  after  register  renaming. 

Let’s  look  in  more  detail  at  the  operation  of  the  reorder  buffer.  When  an  instruc¬ 
tion  is  decoded  and  placed  in  the  instruction  window  (in  practice,  a  reservation  sta¬ 
tion),  a  register  in  the  reorder  buffer  is  assigned  to  hold  the  instruction  result. 
Figure  1-8  shows  the  format  of  information  held  in  the  reorder  buffer.  When  the 
instruction  is  issued  from  the  reservation  station  and,  at  a  later  time,  execution  com¬ 
pleted,  the  result  is  written  to  the  assigned  reorder  buffer  entry. 

If  a  future  instruction  refers  to  the  result  of  a  previous  instruction,  the  reorder 
buffer  is  accessed  to  obtain  the  necessary  value.  The  reorder  buffer  is  accessed  via  the 
contents  of  the  destination-tag  field.  This  is  known  as  a  content-addressable 
memory  access.  A  parallel  search  of  the  reorder  buffer  is  performed.  All  memory 
locations  are  simultaneously  examined  to  determine  if  they  have  the  requested  data. 
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Before  Renaming  ^  gr97  After  Renaming  *  RR2 

“gr97” 

Figure  1-7.  Register  Dependency  Resolved  by  Register  Renaming 


If  the  instruction  producing  the  result  has  not  yet  completed  execution,  then  the 
dispatched  instruction  is  provided  with  a  reorder-buffer-tag  for  the  pending  data. 
For  example,  the  second  instruction  in  the  above  code  sequence  would  receive 
register-buffer-tag  RR1 . 

It  is  likely  that  the  reorder  buffer  contains  entries  which  are  destined  (tagged)  for 
the  same  result  register.  When  the  reorder  buffer  is  accessed  with  a  destination-tag 
which  has  multiple  entries,  the  reorder  buffer  provides  the  most  recent  entry.  This 
ensures  the  most  recently  assigned  (according  to  instruction  decode)  value  is  used.  In 
such  case,  the  older  entry  could  be  discarded;  but  it  is  kept  in  case  of  an  exceptional 
event,  such  as  an  interrupt  or  trap,  occurring. 

When  an  instruction  completes,  the  reorder  buffer  entry  is  updated  with  the  re¬ 
sult  value.  A  number  of  result  busses  are  used  to  forward  result  values,  and  their 
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older 

entry 


= _  entry 
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newer 
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Figure  1  -8.  Circular  Reorder  Buffer  Format 
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associated  tag  information,  to  the  reorder  buffer.  Function  units  monitor  the  flow  of 
data  along  these  buses  in  the  hope  of  acquiring  data  values  required  by  their  reserva¬ 
tion  stations.  In  this  way,  instructions  are  supplied  the  operands  which  where  missing 
when  the  instruction  was  decoded.  When  a  reorder  buffer  has  been  updated  with  a 
result,  the  entry  is  ready  for  retiring.  This  is  the  term  given  to  writing  the  result  value 
into  the  real  register  in  the  register  file.  There  is  a  bus  for  this  task  which  connects 
read  ports  on  the  reorder  buffer  to  write  ports  on  the  register  file.  The  number  of  ports 
assigned  to  this  task  (2)  limits  the  number  of  instructions  which  can  be  retired  in  any 
one  processor  cycle.  A  register  file  with  two  write  ports  supports  a  maximum  of  four 
instructions  being  retired  during  the  same  cycle;  two  instructions  which  modify  re¬ 
sult  registers,  one  store  instruction,  and  one  branch  instruction  (these  last  two  instruc¬ 
tion  types  do  not  write  to  result  registers).  Figure  1-9  outlines  the  system  layout. 

When  the  reorder  buffer  becomes  full,  no  further  instruction  decoding  can  occur 
until  entries  are  made  available  via  instruction  retiring.  Instructions  are  retired  in  the 
order  they  are  placed  in  the  reorder  buffer.  This  ensures  in-order  retiring  of 
instructions.  Should  an  exceptional  event  occur  during  program  execution,  the  state 
of  instruction  retirement  specifies  the  precise  position  which  execution  has  reached 
within  the  program.  Only  completed  instructions,  without  exceptions,  are  retired. 


Result  and  Tag  Buses 


3  word 


Figure  1-9.  Multiple  Function  Units  with  a  Reorder  Buffer 
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Figure  1-9  shows  the  operand  busses  supplying  source  operands  from  the 
reorder  buffer  to  the  reservation  stations.  However,  in  some  cases,  when  an 
instruction  is  decoded  and  the  operand  register’s  number  presented  to  the  reorder 
buffer,  no  entry  is  found.  This  indicates  there  is  currently  no  copy  of  the  required 
register.  Consequently,  the  real  register  in  the  register  file  must  be  accessed  to  obtain 
the  data.  For  this  reason  the  register  file  is  provided  with  read  ports  (4)  which  supply 
data  to  the  operand  bus. 

1.7.4  Branch  Prediction 

Out-of-order  instruction  issue  places  a  heavy  demand  on  instruction  decoding. 
If  reservation  stations  are  to  be  kept  filled,  instruction  decode  must  proceed  at  a  rate 
equal  to,  or  greater  than,  instruction  execution.  Otherwise,  performance  will  be 
limited  by  the  ability  to  decode  instructions.  The  major  obstacle  in  the  way  of 
achieving  efficient  decoder  operation  is  branching  instructions.  Unfortunately, 
instruction  sequences  typically  contain  only  about  five  or  six  instructions  before  a 
further  branch-type  instruction  is  encountered.  Compilers  directed  to  producing 
code  specifically  for  superscalar  processor  execution  try  to  increase  this  critical 
parameter.  Additionally,  the  fact  that  a  target  of  a  branch  instruction  need  not  be 
aligned  on  a  cache  block  boundary,  can  further  reduce  the  efficiency  of  the  decoding 
processes. 

The  decoder  fetches  instructions  and  places  them  into  the  instruction  window 
for  issue  by  a  function  unit.  If  an  average  decode  rate  of  more  than  two  instructions 
per  cycle  is  to  be  achieved,  it  is  likely  that  a  four-instruction  decoder  (or  better)  will 
be  required.  In  fact,  AMD’s  product  overview  indicates  a  four-instruction  decoder  is 
used.  To  study  this  further,  first  examine  the  code  below.  The  first  target  sequence 
begins  at  address  label  L13.  The  linker  need  not  align  the  L13  label  at  a  cache  block 
boundary  —  a  cache  block  size  of  four  instructions  will  be  assumed.  The  same 
alignment  issue  occurs  with  the  second  target  sequence  beginning  at  label  LI 4.  The 
decoder  is  presented  with  a  complete  cache  block  rather  than  sequential  instructions 
from  within  the  block.  This  requires  a  128-bit  bus  between  the  instruction  cache  and 
the  decode  unit.  However,  this  is  essential  if  instructions  are  to  be  decoded  in  parallel. 
Figure  1-10  shows  a  possible  cache  block  assignment,  assuming  the  target  of  the  first 
instruction  sequence  begins  in  the  second  entry  of  the  cache  block.  The  target  of  the 
second  sequence  begins  in  the  third  instruction  of  the  block. 


; target  of  a 

branch 

add 

gr 98 , gr 98 , 10 

; gr98  =  gr98 

+  10 

sll 

gr99, gr99, 2 

cpgt 

gr97, gr97, gr98 

jmpt 

gr97, L14 

; conditional 

branch  to  L14 

add 

lr4 , lr4 , gr99 

;branch  delay  slot,  see  section  1.13 

load 

0,0, gr97, lr4 

store 

0,0, gr97, gr96 
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L14  : 


; target  of  branch 
unconditional  branch  to  L16 
;branch  delay  slot,  always  executed 


jmp  LI  6 

const  lrlO, 0 


The  branch  instruction  from  the  first  code  sequence  to  label  L 14  is  located  in  the 
second  instruction  of  the  block.  Assuming  two  cycles  are  required  to  fetch  the  target 
block,  the  decoder  is  left  with  nothing  to  decode  for  several  cycles.  Additionally, 
branch  alignment  has  resulted  in  there  being  less  than  four  instructions  available  for 
decode  during  any  cycle.  The  resulting  decode  rate  is  1  instruction  per  cycle.  This 
would  result  in  little  better  than  scalar  processor  performance  —  much  less  that  the 
desired  2  or  more  instructions  per  cycle. 


Cache  block  being  decoded 


Figure  1-10.  Instruction  Decode  with  No  Branch  Prediction 


In  Figure  1-10  the  target  sequence  is  found  in  the  cache.  Of  course  the  cost  of  the 
branch  would  be  much  higher  if  the  target  instructions  had  to  be  fetched  from 
off-chip  memory.  Additionally,  a  two-cycle  branch  delay  is  shown.  This  is  typically 
defined  as  the  time  from  decoding  the  branch  instruction  till  decoding  the  target 
instruction.  The  actual  delay  encountered  is  difficult  to  estimate,  as  the  target  address 
is  not  known  until  the  jump  instruction  is  executed.  Figure  1-10  shows  the  cycle 
when  the  jump  instruction  is  placed  in  the  instruction  window.  When  it  will  be  issued 
depends  on  a  number  of  factors  such  as  register  dependency  and  reservation  station 
activity.  Additionally  the  result  of  the  jump  must  be  forwarded  to  the  decode  unit 
before  further  instruction  decode  can  proceed.  In  practice,  several  cycles  could 
elapse  before  the  decoder  obtains  the  address  of  the  cache  block  containing  the  target 
instruction. 

It  is  clear  from  the  above  discussion  that  a  superscalar  processor  must  take  steps 
to  achieve  a  higher  instruction  decode  rate.  This  is  likely  to  involve  some  form  of 
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branch  prediction.  The  decoder  can  not  wait  for  the  outcome  of  the  branch  instruction 
to  be  know  before  it  starts  fetching  the  new  instruction  stream.  It  must  examine  the 
instruction  currently  being  decoded,  and  determine  if  a  branch  is  present.  When  a 
branching  instruction  is  found,  the  decoder  must  predict  both  if  the  branch  will  be 
taken  and  the  target  of  the  branch.  This  enables  instructions  to  be  fetched  and 
decoded  along  the  predicted  path.  Of  course,  unconditional  branches  also  benefit 
from  early  fetching  of  their  target  instruction  sequence;  and  they  do  not  require 
branch  prediction  support. 

The  instruction  decode  sequence  for  the  previous  code  example  is  shown  in 
Figure  1-11  using  branch  prediction.  Without  waiting  for  the  conditional-jump 
instruction  in  the  second  entry  of  the  cache  block  to  execute,  the  decoder  predicts  the 
branch  will  be  taken  and  in  the  next  cycle  starts  decoding  the  block  containing  the 
target  instruction.  This  results  in  a  decode  rate  of  2.33  instructions  per  cycle.  If  the 
prediction  is  correct,  the  decoder  should  be  able  to  sustain  a  decode  rate  which 
prevents  starving  the  function  units  of  instructions. 


Cache  block  being  decoded 


add  gr98,gr98,10 

sll  gr99, gr99, 2 

cpgt  gr97 , gr97 , gr98 

jmpt  gr97,L14 

add  Ir4,lr4,gr99 

jmp  LI  6 

const  lrl0,0 

Average  Decode  =  7/3 

time 
in  cycles 


rate  =  2.33  instructions/cycle 


Figure  1-11.  Four-Instruction  Decoder  with  Branch  Prediction 


Branch  prediction  supports  speculative  instruction  fetching.  It  results  in 
instructions  being  placed  in  the  instruction  window  which  may  be  speculatively 
dispatched  and  executed.  If  the  branch  is  wrongly  predicted,  instructions  still  waiting 
in  reservation  stations  must  be  cancelled.  Any  wrongly  predicated  instructions  which 
reach  execution  must  not  be  retired.  This  requires  considerable  support  circuitry.  For 
this  reason  scoreboarding  is  used  by  some  processors  to  support  speculative 
instruction  fetching.  With  scoreboarding  the  decoder  sets  a  scoreboard  bit  for  each 
instruction’s  destination  register.  Since  there  is  only  one  bit  indicating  there  is  a 
pending  update,  there  can  be  only  one  such  update  per  register.  Consequently,  the 
decoder  stalls  when  encountering  an  instruction  required  to  update  a  register  which 
already  has  a  pending  update.  The  scoreboarding  mechanism  is  simpler  to  implement 
than  register  renaming  using  a  reorder  buffer.  However,  its  restrictions  limit  the 
decoder’s  ability  to  speculatively  fetch  instruction  further  ahead  of  actual  execution. 
This  has  been  shown  to  result  in  about  21%  poorer  performance  when  a 
four-instruction  decoder  is  used  [Johnson  1991]. 
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It  is  certain  that  a  superscalar  29K  processor  will  incorporate  a  branch 
prediction  technique.  Given  that  instruction  compatibility  is  to  be  maintained,  it  is 
likely  that  a  hardware  prediction  rather  than  a  software  prediction  method  will  be 
employed.  This  will  require  the  processor  to  keep  track  of  previous  branch  activity. 
An  algorithm  will  likely  help  with  selecting  the  most  frequent  branch  paths;  such  as 
branches  to  lower  addresses  are  more  often  taken  then  not  — jump  at  bottom  of  loop. 


1.8  THE  Am29200  MICROCONTROLLER 

The  Am29200™  was  the  first  of  the  29K  family  microcontrollers  (see 
Table  1-3)  [AMD  1992b].  To  date  the  Am29205™  is  the  only  other  microcontroller 
added  to  the  family.  Being  microcontrollers,  many  of  the  device  pins  are  assigned  I/O 
and  other  dedicated  support  tasks  which  reduce  system  glue  logic  requirements.  For 
this  reason  none  of  the  devices  are  pin  compatible.  The  system  support  facilities,  in¬ 
cluded  within  the  Am29200  package,  make  it  ideal  for  many  highly  integrated  and 
low  cost  systems. 

The  processor  supports  a  32-bit  address  space  which  is  divided  into  a  number  of 
dedicated  regions  (see  Figure  1-12).  This  means  that  ROM,  for  example,  can  only  be 
located  in  the  region  preallocated  for  ROM  access.  When  an  address  value  is  gener¬ 
ated,  the  associated  control-logic  for  the  region  is  activated  and  used  to  control  data 
or  instruction  access  for  the  region. 

There  is  a  32-bit  data  bus  and  a  separate  24-bit  address  bus.  The  rest  of  the  104 
pins  used  by  the  device  are  mainly  for  I/O  and  external  peripheral  control  tasks 
associated  with  each  of  the  separate  address  regions. 

By  incorporating  memory  interface  logic  within  the  chip,  the  processor  enables 
lower  system  costs  and  simplified  designs.  In  fact,  DRAM  devices  can  be  wired  di¬ 
rectly  to  the  microcontroller  without  the  need  for  any  additional  circuitry. 

At  the  core  of  the  microcontroller  is  an  Am29000  processor.  The  additional  I/O 
devices  and  region  control  mechanisms  supported  by  the  chip  are  operated  by  pro¬ 
grammable  registers  located  in  the  control  register  region  of  memory  space.  These 
control  registers  are  accessible  from  alternate  address  locations  —  for  historical  rea¬ 
sons.  It  is  best,  and  essential  if  C  code  is  used,  to  access  these  registers  from  the  op¬ 
tional  word-aligned  addresses. 

Accessing  memory  or  peripherals  located  in  each  address  region  is  achieved 
with  a  dedicated  region  controller.  While  initializing  the  control  registers  for  each 
region  it  is  possible  to  specify  the  access  times  and,  say,  the  DRAM  refresh  require¬ 
ments  for  memory  devices  located  in  the  associated  region. 

Other  peripheral  devices  incorporated  in  the  microcontroller,  such  as  the  UART, 
are  accessed  by  specific  control  registers.  The  inclusion  of  popular  peripheral  de¬ 
vices  and  the  associated  glue  logic  for  peripheral  and  memory  interfaces  within  a 
single  RISC  chip,  enables  higher  performance  at  lower  costs  than  existing  systems 
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Table  1  -3.  Am2920x  Microcontroller  Members  of29K  Processor  Family 


Processor 

Am29200 

Am 29205 

Instruction  Cache 

- 

- 

I-Cache  Associativity 

- 

- 

Date  Cache 

- 

- 

D-Cache  Associativity 

- 

- 

On-Chip  Floating-Point 

No 

No 

On-Chip  MMU 

No 

- 

Integer  Multiply  in  h/w 

No 

No 

Programmable  I/O 

16  pins 

8  pins 

ROM  width 

8/16/32  bit 

16  bit 

DRAM  width 

16/32  bit 

16  bit 

On-Chip  Interrupt 

Yes 

Yes 

Controller  Inputs 

14 

10 

Scalable  Clocking 

No 

No 

Burst-mode  Addressing 

Yes,  up  to  IK  bytes 

Yes,  up  to  IK  bytes 

Freeze  Mode  Processing 

Yes 

Yes 

Delayed  Branching 

Yes 

Yes 

On-Chip  Timer 

Yes 

Yes 

On-Chip  Memory  Controler 

Yes 

Yes 

DMA  Channels 

2 

1 

Byte  Endian 

Big 

Big 

Serial  Ports 

1 

1 

JTAG  Debugging 

Yes 

No 

Clock  Speeds  (MHz) 

16.7,  20 

12.5,16.7 
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Region  Allocation  Address  Range 

reserved 

Oxffff , ffff 

0x9600, 0000 

PIA  space 

0x9000, 0000 

control  regs. 

0x8000, 0000 

video-DRAM 

0x6000, 0000 

virtual-DRAM 

0x5000, 0000 

DRAM 

0x4000, 0000 

ROM 

0x0 

Figure  1-12.  Am29200  Microcontroller  Address  Space 

Regions 

(see  Figure  1-13).  Let’s  take  a  quick  look  at  each  of  the  region  controllers  and  spe¬ 
cialized  on-chip  peripherals  in  turn. 

1.8.1  ROM  Region 

First  thing  to  realize  is  that  ROM  space  is  really  intended  for  all  types  of 
nonmultiplexed-address  devices,  such  as  ROM  and  SRAM.  Controlling  access  to 
these  types  of  memories  is  very  similar.  The  region  is  divided  into  four  banks.  Each 
bank  is  individually  configurable  in  width  and  timing  characteristics.  A  bank  can  be 
associated  with  8-bit,  16-bit  or  32-bit  memory  and  can  contain  as  much  as  16M 
bytes  of  memory  (enabling  a  64M  bytes  ROM  region). 

Bank  0,  the  first  bank,  is  normally  attached  to  ROM  memory  as  code  execution 
after  processor  reset  starts  at  address  0.  During  reset  the  BOOTW  (boot  ROM  width) 
input  pin  is  tested  to  determine  the  width  of  Bank  0  memory.  Initially  the  memory  is 
assumed  to  have  4-cycle  access  times  (three  wait  states)  and  no  burst-mode.  The 
S  A29200  evaluation  board  contains  an  8-bit  EPROM  at  bank  0  (S  A  stands  for  stand¬ 
alone)  .  Other  banks  may  contain,  say,  32-bit  SRAM  with  different  wait  state  require¬ 
ments.  It  is  possible  to  arrange  banks  to  form  a  contiguous  address  range. 

Whenever  memory  in  the  ROM  address  range  is  accessed,  the  controller  for  the 
region  is  activated  and  the  required  memory  chip  control  signals  such  as  CE  (chip 
enable),  R/W,  OE  (output  enable)  and  others  are  generated  by  the  microcontroller. 
Thus  SRAM  and  EPROM  devices  are  wired  directly  to  pins  on  the  microcontrol  chip. 


Chapter  1  Architectural  Overview 


35 


Am29000 


A  I  D 


parallel 

port 


serial 

port 


video 

interface 


I/O 

port 


ROM 

controller 


PIA 


DMA 

controller 


DRAM 

controller 


interrupt 

controller 


ROM  or 
SRAM 
Memory 


DRAM 

Memory 


Figure  1  -1 3.  Am29200  Microcontroller  Block  Diagram 


1.8.2  DRAM  Region 

In  a  way  similar  to  the  ROM  region,  there  is  a  dedicated  controller  for  DRAM 
devices  which  are  restricted  to  being  located  in  the  DRAM  address  region.  Once 
again  the  region  is  divided  into  four  banks  which  may  each  contain  as  much  as  16M 
bytes  of  off-chip  memory.  The  DRAM  region  controller  supports  16-bit  or  32-bit 
wide  memory  banks  which  may  be  arranged  to  appear  as  contiguous  in  address 
range. 

DRAM,  unlike  ROM,  is  always  assumed  to  have  4-cycle  access  times.  Howev¬ 
er,  if  page-mode  DRAM  is  used  it  is  possible  to  achieve  2-cycle  rather  than  4-cycle 
burst-mode  accesses.  Burst-mode  is  used  when  consecutive  memory  addresses  are 
being  accessed,  such  as  during  instruction  fetching  between  program  branches.  The 
DRAM  memory  is  often  referred  to  as  3/2  rather  than  4/2.  The  four  cycles  consist  of 
1 -cycle  precharge  and  3-cycles  latency,  under  certain  circumstances  the  1-cycle  of 
precharge  can  be  hidden.  This  is  explained  in  section  1.14.1  under  the  Am29200  and 
Am29205  subheading. 

The  control  register  associated  with  each  DRAM  bank,  maintains  a  field  for 
DRAM  refresh  support.  This  field  indicates  the  number  of  processor  cycles  between 
DRAM  refresh.  If  refresh  is  not  disabled,  “CAS  before  RAS”  cycles  are  performed 
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when  required.  Refresh  is  overlapped  in  the  background  with  non-DRAM  access 
when  possible. 

If  a  DRAM  bank  contains  video-DRAM  rather  than  conventional  DRAM,  then 
it  is  possible  to  perform  data  transfer  to  the  VDRAM  shift  register  via  accesses  in  the 
VDRAM  address  range.  The  VDRAM  is  aliased  over  the  DRAM  region.  Accessing 
the  memory  as  VDRAM  only  changes  the  timing  of  memory  control  signals  such  as 
to  indicate  a  video  shift  register  transfer  is  to  take  place  rather  than  a  CPU  memory 
access. 

1.8.3  Virtual  DRAM  Region 

A  16-Mbyte  (24  address  bit)  virtual  address  space  is  supported  via  four  map¬ 
ping  registers.  The  virtually  addressed  memory  is  divided  into  64K  byte  (16  address 
bits)  memory  pages  which  are  mapped  into  physical  DRAM.  Each  mapping  register 
has  two  8-bit  fields  specifying  the  upper  address  bits  of  the  mapped  memory  pages. 
When  memory  is  accessed  in  the  virtual  address  space  range,  and  one  of  the  four 
mapping  registers  contains  a  match  for  the  virtually  addressed  page  being  accessed, 
then  the  access  is  redirected  to  the  physical  DRAM  page  indicated  by  the  mapping 
register. 

When  no  mapping  register  contains  a  currently  valid  address  translation  for  the 
required  virtual  address,  a  processor  trap  occurs.  In  this  case  memory  management 
support  software  normally  updates  one  of  the  mapping  registers  with  a  valid  mapping 
and  normal  program  execution  is  restarted. 

Only  DRAM  can  be  mapped  into  the  virtual  address  space.  The  address  region 
supports  functions  such  as  image  compression  and  decompression  that  yield  lower 
overall  memory  requirements  and,  thus,  lower  system  costs.  Images  can  be  stored  in 
virtually  addressed  space  in  a  compressed  form,  and  only  uncompressed  into  physi¬ 
cally  accessed  memory  when  required  for  image  manipulation  or  output  video  imag¬ 
ing. 

1.8.4  PIA  Region 

The  Peripheral  Interface  Adapter  (PIA)  region  is  divided  into  six  banks,  each  of 
24-bit  address  space.  Each  bank  can  be  directly  attached  to  a  peripheral  device.  The 
control  registers  associated  with  the  region  give  extra  flexibility  in  specifying  the 
timing  for  signal  pins  connecting  the  microcontroller  and  PIA  peripherals.  The  PIA 
device-enable  and  control  signals  are  again  provided  on-chip  rather  than  in  external 
support  circuitry. 

When  external  DMA  is  utilized,  transfer  of  data  is  always  between  DRAM  or 
ROM  space  and  PIA  space.  More  on  DMA  follows. 

1.8.5  DMA  Controller 

When  an  off-chip  device  wishes  to  gain  access  to  the  microcontroller  DRAM,  it 
makes  use  of  the  Direct  Memory  Access  (DMA)  Controller.  On-chip  peripherals  can 
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also  perform  DMA  transfers;  this  is  referred  to  as  internal  DMA.  DMA  is  initiated  by 
an  external  or  internally  generated  peripheral  DMA  request. 

The  only  internal  peripherals  which  can  generate  DMA  requests  are  the  parallel 
port,  the  serial  port  and  the  video  interface.  These  three  devices  are  described  shortly. 
There  are  two  external  DMA  request  pins,  one  for  each  of  the  two  on-chip  DMA  con¬ 
trol  units.  Internal  peripherals  have  a  control  register  field  which  specifies  which 
DMA  controller  their  DMA  request  relates  to. 

The  DMA  controllers  must  be  initialized  by  software  before  data  transfer  from, 
or  to,  DRAM  takes  place.  The  associated  control  registers  specify  the  DRAM  start 
address  and  the  number  of  transfers  to  take  place.  Once  the  DMA  control  registers 
have  been  prepared,  a  DMA  transfer  will  commence  immediately  upon  request  with 
out  any  further  CPU  intervention.  Once  the  DMA  transfer  is  complete  the  DMA  con¬ 
troller  may  generate  an  interrupt.  The  processor  may  then  refresh  the  DMA  control 
unit  parameters  for  the  next  expected  DMA  transfer. 

One  of  the  DMA  control  units  has  the  special  feature  of  having  a  duplicate  set  of 
DMA  parameter  registers.  At  the  end  of  a  DMA  transfer,  when  the  primary  set  of 
DMA  parameter  registers  have  been  exhausted,  the  duplicate  set  is  immediately  co¬ 
pied  into  the  primary  set.  This  means  the  DMA  unit  is  instantly  refreshed  and  pre¬ 
pared  for  a  further  DMA  request.  Ordinarily  the  DMA  unit  is  not  ready  for  further  use 
until  the  support  software  has  executed,  usually  via  an  end  of  DMA  interrupt  request. 
Just  such  an  interrupt  may  be  generated  but  it  will  now  be  concerned  with  preparing 
parameters  for  the  duplicate  control  registers  for  the  one-after-next  DMA  request. 
This  DMA  queue  technique  is  very  useful  when  DMA  transfers  are  occurring  to  the 
video  controller.  In  such  case  DMA  can  not  be  postponed  as  video  imaging  require¬ 
ments  mean  data  must  be  available  if  image  distortion  is  to  be  avoided. 

External  DMA  can  only  occur  between  DRAM  or  ROM  space  and  two  of  the  six 
PIA  address  space  banks.  DMA  only  supports  an  8-bit  address  field  within  a  PIA  ad¬ 
dress  bank. 

One  further  note  on  DMA,  the  microcontroller  does  support  an  external  DMA 
controller;  enabling  random  access  by  the  external  DMA  device  to  DRAM  and 
ROM.  The  external  DMA  unit  must  activate  the  associated  control  pins  and  place  the 
address  on  the  microcontroller  address  bus.  In  conjunction  with  the  microcontroller, 
the  external  DMA  unit  must  complete  the  single  32-bit  data  access. 

1.8.6  16-bit  I/O  Port 

The  I/O  port  supports  bit  programmable  access  to  16  input  or  output  pins.  These 
pins  can  also  be  used  to  generate  level-sensitive  or  edge-sensitive  interrupts.  When 
used  as  outputs,  they  can  be  actively  driven  or  used  in  open  collector  mode. 

1.8.7  Parallel  Port 

The  parallel  port  is  intended  for  connecting  the  microcontroller  chip  to  a  host 
processor,  where  the  controller  acts  as  an  intelligent  high  performance  control  unit. 
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Data  can  be  transferred  in  both  directions,  either  via  software  controlled  8-bit  or 
32-bit  data  words,  or  via  DMA  unit  control.  Once  again  the  associated  control  regis¬ 
ters  give  the  programmer  flexibility  in  specifying  the  timing  requirements  for  con¬ 
necting  the  parallel  port  directly  to  the  host  processor. 

1.8.8  Serial  Port 

The  on-chip  serial  port  supports  high  speed  full  duplex,  bi-directional  data 
transfer  using  the  RS-232  protocol.  The  serial  port  can  be  used  in  an  polled  or  inter¬ 
rupted  driven  mode.  Alternatively,  it  may  request  DMA  access.  The  lightweight  in¬ 
terrupt  structure  of  the  Am29000  processor  core,  coupled  with  the  smart  on-chip  pe¬ 
ripherals,  presents  the  software  engineer  with  a  wide  range  of  options  for  controlling 
the  serial  port. 

1.8.9  I/O  Video  Interface 

The  video  interface  provides  direct  connection  to  a  number  of  laser-beam 
marking  engines.  It  may  also  be  used  to  receive  data  from  a  raster  input  device  such  as 
a  scanner  or  to  serialize/deserialize  a  data  stream.  It  is  possible  with  external  circuitry 
support  that  a  noninterleaved  composite  TV  video  signal  could  be  generated. 

The  video  shift  register  clock  must  be  supplied  on  an  asynchronous  input  pin, 
which  may  be  tied  to  the  processor  clock.  (Note,  a  video  image  is  built  by  serially 
clocking  the  data  in  the  shift  register  out  to  the  imaging  hardware.  When  the  shift  reg¬ 
ister  is  empty  it  must  be  quickly  refilled  before  the  next  shift  clock  occurs.)  The 
imaged  page  may  be  synchronized  to  an  external  page-sync  signal.  Horizontal  and 
vertical  image  margins  as  well  as  image  scan  rates  are  all  programmable  via  the  now 
familiar  on-chip  control  register  method. 

The  video  shift  registers  are  duplicated,  much  like  some  of  the  DMA  control 
registers.  This  reduces  the  need  for  rapid  software  response  to  maintain  video  shift 
register  update.  When  building  an  image,  the  shift  register  is  updated  from  the  dupli¬ 
cate  support  register.  Software,  possibly  activated  via  a  video-register-empty  inter¬ 
rupt,  must  fill  the  duplicate  shift  register  before  it  becomes  used-up.  Alternatively, 
the  video  data  register  can  be  maintained  by  the  DMA  controller  without  the  need  for 
frequent  CPU  intervention. 

1.8.10  The  SA29200  Evaluation  Board 

The  SA29200  is  an  inexpensive  software  development  board  utilizing  the 
Am29200  microcontroller.  Only  a  5v  supply  and  a  serial  cable  connection  to  a  host 
computer  are  required  to  enable  board  operation.  Included  on  the  board  is  an  8-bit 
wide  EPROM  (128Kx8)  which  contains  the  MiniMON29K  debug  monitor  and  the 
OS-boot  operating  system.  There  is  also  1M  byte  of  32-bit  DRAM  (256Kx32)  into 
which  programs  can  be  loaded  via  the  on-chip  UART.  The  processor  clock  rate  is  16 
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MHz  and  the  DRAM  operates  with  4-cycle  initial  access  and  2-cycle  subsequent 
burst  accesses.  So,  although  the  performance  is  good,  it  is  not  as  high  as  other  mem¬ 
bers  of  the  29K  family. 

The  SA29200  board  measures  about  3  by  3.5  inches  (9x10  cm)  and  has  connec¬ 
tions  along  both  sides  which  enable  attachment  to  an  optional  hardware  prototyping 
board  (see  following  section).  This  extension  board  has  additional  I/O  interface  de¬ 
vices  and  a  small  wire-wrap  area  for  inclusion  of  application  specific  hardware. 

1.8.11  The  Prototype  Board 

The  prototying  board  is  inexpensive  because  it  contains  mainly  sockets,  which 
can  support  additional  memory  devices,  and  a  predrilled  wire-wrap  area.  The  RISC 
microcontroller  signals  are  made  available  on  the  prototyping  board  pins.  Some  of 
these  signals  are  routed  to  the  empty  memory  sockets  so  as  to  enable  simple  memory 
expansions  for  8-bit,  16-bit  or  32-bit  EPROM  or  SRAM.  There  is  also  space  for  up 
to  16M  bytes  of  32-bit  DRAM. 

Using  the  wire- wrap  area  the  microcontroller  I/O  signals  can  be  connected  to 
devices  supporting  specific  application  tasks,  such  as  A/D  conversion  or  peripheral 
control.  This  makes  the  board  ideal  for  a  student  project.  Additionally,  the  access 
times  for  memory  devices  are  programmable,  thus  enabling  the  effects  of  memory 
performance  on  overall  system  operation  to  be  evaluated. 

1.8.12  Am29200  Evaluation 

The  Combination  of  the  GNU  tool  chain  and  the  low  cost  SA29200  evaluation 
board  and  associated  prototping  board,  makes  available  an  evaluation  environment 
for  the  industry’s  leading  embedded  RISC.  The  cost  of  getting  started  with  embedded 
RISC  is  very  low  and  additional  high  performance  products  can  be  selectively  pur¬ 
chased  from  specialized  tool  builders.  The  evaluation  package  should  be  of  particu¬ 
lar  interest  to  university  undergraduate  and  post-graduate  courses  studying  RISC. 

1.8.13  The  Am29205  Microcontroller 

The  Am29205  is  a  microcontroller  member  of  the  29K  family  (see  Table  1-3).  It 
is  functionally  very  similar  to  the  Am29200  microcontroller.  It  differs  as  a  result  of 
reduced  system  interface  specifications.  This  reduction  enables  a  lower  device  pin- 
count  and  packaging  cost.  The  Am29205  is  available  in  a  100-lead  Plastic  Quad  Flat 
Pack  (PQFP)  package.  It  is  suitable  for  use  in  price  sensitive  systems  which  can  oper¬ 
ate  with  the  somewhat  reduced  on-chip  support  circuitry. 

The  reduction  in  pin  count  results  in  a  16-bit  data/instruction  bus.  The  processor 
generates  two  consecutive  memory  requests  to  access  instructions  and  data  larger 
than  16-bits.  The  memory  system  interface  has  also  been  simplified  in  other  ways. 
Only  16-bit  transfers  to  memory  are  provided  for;  no  8-bit  ROM  banks  are  sup- 
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ported.  The  parallel  port,  DMA  controller,  and  PIA,  also  now  support  transfers  lim¬ 
ited  to  the  16-bit  data  width. 

Generally  the  number  of  service  support  pins  such  as:  programmable  Input/Out¬ 
put  pins  (now  8,  16  for  the  Am29200  processor);  serial  communication  handshake 
signals  DTR,  DSR;  DMA  request  signals;  interrupt  request  pins;  and  number  of  de¬ 
coded  PIA  and  memory  banks,  have  all  been  reduced.  The  signal  pins  supporting  vid- 
eo-DRAM  and  burst-mode  ROM  access  have  also  been  omitted.  These  omissions 
do  not  greatly  restrict  the  suitability  of  the  Am29205  microcontroller  for  many  proj¬ 
ects.  The  need  to  make  two  memory  accesses  to  fetch  instructions,  which  are  not  sup¬ 
ported  by  an  on-chip  cache  memory,  will  result  in  reduced  performance.  However, 
many  embedded  systems  do  not  require  the  full  speed  performance  of  a  32-bit  RISC 
processor. 

AMD  provides  a  low  cost  evaluation  board  known  as  the  S  A29205 .  The  board  is 
standalone  and  very  like  the  SA29200  evaluation  board;  in  fact,  it  will  fit  with  the 
same  prototype  expansion  board  used  by  the  SA29200.  It  is  provided  with  a  256k 
byte  EPROM,  organized  as  128kxl6  bits.  The  EPROM  memory  is  socket  upgradable 
to  1M  byte.  There  is  5 12K  byte  of  16-bit  wide  DRAM.  For  debugging  purposes,  it 
can  use  the  MiniMON29K  debug  monitor  utilizing  the  on-chip  serial  port. 

1.9  THE  Am29240  MICROCONTROLLER 

The  Am29240  is  a  follow-on  to  the  Am29200  microcontroller  (see  Table  1-4). 
It  was  first  introduced  in  1993.  The  Am29240  is  a  member  of  the  Am2924x  family 
grouping  which  offers  increased  on-chip  support  and  greater  processing  power.  In 
terms  of  peripherals  the  Am29240  has  two  serial  ports  in  stead  of  the  Am29200’s  one. 
It  also  has  4  DMA  controllers  in  stead  of  two. 

Unlike  the  Am29200,  all  of  the  Am29240  DMA  channels  support  queued  data 
transfer.  Additionally,  fly-by  DMA  transfers  are  optionally  supported.  Normal 
DMA  transfers  require  a  read  stage  followed  by  a  write  stage.  The  data  being  trans¬ 
ferred  is  temporarily  held  in  an  on-chip  buffer  after  being  read.  With  fly-by  DMA  the 
read  and  write  stages  occur  at  the  same  time.  This  results  in  a  faster  DMA  transfer. 
However,  the  device  being  accessed  must  be  able  to  transfer  data  at  the  maximum 
DRAM  access  rate. 

The  Am2924x  family  grouping,  unlike  the  Am2920x  grouping,  support  virtual 
memory  addressing.  The  Translation  Look-Aside  Buffer  (TLB)  used  to  construct  an 
MMU  scheme  supports  larger  page  sizes  than  the  Am29000  processor.  The  page  size 
can  be  up  to  16M  bytes.  The  large  page  size  enables  extensive  memory  regions  to  be 
mapped  with  only  a  few  TLB  mapping  entries.  For  this  reason  only  16  TLB  entries 
are  provided  (8  sets,  two  entries  per  set).  A  consequence  of  the  relatively  large  page 
size  is  pages  can  not  be  individually  protected  against  Supervisor  mode  reads  and 
execution  —  this  is  possible  with  the  smaller  pages  used  by  the  Am29000  processor 
(see  section  6.2.1).  This  loss  is  outweighed  by  the  benefits  of  the  larger  page  size 
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Table  1  -4.  Am2924x  Microcontroller  Members  of29K  Processor  Family 


Processor 

Am29240 

Am  29243 

Am 29245 

Instruction  Cache 

4K  bytes 

4K  bytes 

4K  bytes 

I-Cache  Associativity 

2-Way 

2-Way 

2-Way 

Data  Cache  (Physical) 

2K  bytes 

2K  bytes 

- 

D-Cache  Associativity 

2-Way 

2-Way 

- 

On-Chip  Floating-Point 

No 

No 

No 

On-Chip  MMU 

Yes 

Yes 

Yes 

Integer  Multiply  in  h/w 

Yes,  1-cycle 

Yes,  1-cycle 

No 

Programmable  I/O 

16  pins 

8  pins 

8  pins 

ROM  width 

8/16/32  bit 

16/32  bit 

8/16/32  bit 

DRAM  width 

16/32  bit 

8/16/32  bit  (parity) 

16/32  bit 

On-Chip  Interrupt 

Yes 

Yes 

Yes 

Controller  Input’s 

14 

14 

14 

Scalable  Clocking 

lx,2x 

lx,2x 

No 

Burst-mode  Addressing 

Yes,  up  to  IK  bytes 

Yes,  up  to  IK  bytes 

Yes,  up  to  IK  bytes 

Freeze  Mode  Processing 

Yes 

Yes 

Yes 

Delayed  Branching 

Yes 

Yes 

Yes 

On-Chip  Timer 

Yes 

Yes 

Yes 

On-Chip  Memory  Controller 

Yes 

Yes 

Yes 

DMA  Channels 

4 

4 

2 

Byte  Endian 

Big 

Big 

Big 

Serial  Ports 

2 

2 

1 

JTAG  Debugging 

Yes 

Yes 

Yes 

Clock  Speeds  (MHz) 

0-20,25,33 

0-20,25,33 

0-16 

42 


Evaluating  and  Programming  the  29K  RISC  Family 


which  achieves  virtual  memory  addressing  with  little  TLB  reload  activity  and  with 
only  a  small  amount  of  chip  area  being  required. 

Increased  performance  is  achieved  by  the  inclusion  of  separate  4k  byte  instruc¬ 
tion  and  2k  byte  data  caches.  As  with  all  29K  instruction  caches,  address  tags  are 
based  on  virtual  addresses  when  address  translation  is  turned  on.  The  first  processor 
in  the  29K  Family  to  have  a  conventional  instruction  cache  was  the  Am29030.  The 
Am29240  cache  is  similar  in  operation  to  the  Am29030’s  cache.  However,  the 
Am29240  processor  has  four  valid  bits  per  cache  entry  (four  instructions)  in  place  of 
the  previous  one  bit.  This  offers  a  performance  advantage  as  cache  blocks  need  only 
be  partially  filled  and  need  not  be  fetched  according  to  block  boundaries  (more  on 
this  in  section  5.13.5). 

The  data  cache  always  operates  with  physical  addresses.  The  block  size  is  16 
bytes  and  there  is  one  valid  bit  per  block.  This  means  that  compete  data  blocks  must 
be  fetched  when  data  cache  reload  occurs.  A  “write-through”  policy  is  supported  by 
the  cache  which  ensures  that  external  memory  is  always  consistent  with  cache  con¬ 
tents.  Cache  blocks  are  only  allocated  for  data  loaded  from  DRAM  or  ROM  address 
regions.  Access  to  other  address  regions  is  not  cached.  A  two  word  write-through 
buffer  is  used  to  assist  with  writes  to  memory.  It  enables  multiple  store  instructions  to 
be  in-execution  without  the  processor  pipeline  stalling.  Data  accesses  which  hit  in 
the  cache  require  1-cycle  access  times.  The  data  cache  operation  is  explained  in  de¬ 
tail  in  section  5.14. 

Scalable  bus  clocking  is  supported;  enabling  the  processor  to  run  at  twice  the 
speed  of  the  off-chip  memory  system.  Scalable  Clocking  was  first  introduced  with 
the  Am29030  processors,  and  is  described  in  the  previous  section  describing  the 
Am29030.  If  cache  hit  rates  are  sufficiently  high,  Scalable  Clocking  enables  high 
performance  systems  to  be  built  around  relatively  slow  memory  systems.  It  also  of¬ 
fers  an  excellent  upgrade  path  when  addition  performance  is  required  in  the  future. 

Initially  the  ROM  memory  region  is  assumed  to  have  four  cycle  access  times 
(three  wait  states)  and  no  burst-mode  —  same  as  Am29200.  The  four  banks  within 
the  region  can  be  programmed  for  zero  wait-state  read  and  one  wait-state  write,  or 
another  combination  suitable  for  slower  memory  devices. 

DRAM,  unlike  ROM,  is  always  assumed  to  have  3-cycle  access  times.  Howev¬ 
er,  if  page-mode  DRAM  is  used  it  is  possible  to  achieve  1-cycle  burst-mode  ac¬ 
cesses.  Burst-mode  is  used  when  consecutive  memory  addresses  are  being  accessed, 
such  as  during  instruction  fetching.  The  Am29200  microcontroller  supports  4-cycle 
DRAM  access  with  2-cycle  burst.  The  faster  DRAM  interface  of  the  Am29240 
should  result  in  a  substantial  performance  gain.  Additionally,  the  3-cycle  initial 
DRAM  access  can  be  reduced  to  2-cycle  if  the  required  1-cycle  precharge  can  be 
hidden.  This  is  explained  in  section  1.14.1  under  the  Am29200  and  Am29205  sub¬ 
heading.  Consequently  the  Am29240  DRAM  is  often  referred  to  as  2/1  rather  than 
3/1. 
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The  Am29240  processor  supports  integer  multiply  directly  in  a  single  cycle. 
Most  29K  processors  take  a  trap  when  an  integer  multiply  is  attempted.  It  is  left  to 
trapware  to  emulate  the  missing  instruction.  The  ability  to  perform  high  speed  multi¬ 
ply  makes  the  processor  a  better  choice  for  calculation  intensive  applications  such  as 
digital  signal  processing.  Note,  floating-point  performance  should  also  improve 
with  the  Am29240  as  floating-point  emulation  routines  can  make  use  of  the  integer 
multiply  instruction. 

The  Am2924x  family  grouping  is  implemented  with  a  silicon  process  which  en¬ 
ables  processors  to  operate  at  3.3-volts  or  5-volts.  The  lower  power  consumption 
achievable  at  3.3-volts  makes  the  Am29240  suitable  for  hand-held  type  applica¬ 
tions. 

1.9.1  The  Am29243  Microcontroller 

The  Am29243  is  an  Am29240  microcontroller  enhanced  to  deal  with  commu¬ 
nication  applications  (see  Table  1-4).  For  this  reason  the  video  interface  is  omitted. 
The  pins  used  have  not  been  reassigned,  and  there  is  a  possibility  they  will  be  allo¬ 
cated  in  a  future  microcontroller  for  an  additional  communications  support  function. 

Communication  applications  frequently  require  large  amounts  of  DRAM,  and  it 
is  often  critical  that  no  corruption  of  the  data  occur.  Parity  error  checking  is  often  per¬ 
formed  by  memory  systems  with  the  objective  of  detecting  data  corruption.  It  can  be 
difficult  to  built  the  necessary  circuitry  at  high  memory  system  speeds.  The 
Am29243  microcontroller  has  built-in  parity  generation  and  checking  for  all  DRAM 
accesses.  When  enabled  by  the  DRAM  controller,  the  processor  will  take  trap  num¬ 
ber  4  when  a  parity  error  is  detected.  Having  parity  handling  built-in  enables  single¬ 
cycle  DRAM  accesses  to  be  performed  without  any  external  circuitry  required. 

Because  of  the  larger  amounts  of  memory  typically  used  in  communication  ap¬ 
plications,  the  Am29243  has  a  second  Translation  Look-Aside  Buffer  (TLB).  Hav¬ 
ing  two  TLB  s  enables  a  larger  number  of  virtual  to  physical  address  translations  to  be 
cached  (held  in  a  TLB  register)  at  any  time.  This  reduces  the  TLB  reload  overhead. 
The  second  TLB  also  has  16  entries  (8  sets,  two  entries  per  set),  and  the  page  size  can 
be  the  same  or  different.  If  the  TLB  page  sizes  are  the  same,  a  four-way  set  associa¬ 
tive  MMU  can  be  constructed  with  supporting  software.  Alternatively  one  TLB  can 
be  used  for  code  and  the  second,  with  a  larger  page  size,  for  data  buffers  or  shared 
libraries.  The  TLB  entries  have  a  Global  Page  (GLB)  bit;  when  set  the  mapped  page 
can  be  accessed  by  any  processes  regardless  of  its  process  identifier  (PID). 

1.9.2  The  Am29245  Microcontroller 

The  Am29245  is  a  low-cost  version  of  the  Am29240  microcontroller  (see 
Table  1  -4) .  To  enable  the  lower  cost,  the  data  cache  and  the  integer  multiply  unit  have 
been  omitted.  Further,  there  are  only  two  DMA  channels  in  place  of  the  Am29240’s 
four.  To  further  reduce  cost,  one  of  the  two  serial  ports  has  also  been  omitted. 
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The  Am29245  is  intended  for  use  in  systems  which  do  not  need  the  maximum 
performance  of  the  Am29240  or  all  of  its  peripherals;  and  can  benefit  from  a  reduced 
processor  cost.  The  Am29245  does  not  support  Scalable  Clocking  and  is  only  avail¬ 
able  at  relatively  lower  clock  speeds. 

1.9.3  The  Am2924x  Evaluation 

AMD  has  a  number  of  boards  available  for  Am2924x  evaluation.  Microcontrol¬ 
lers  in  this  family  grouping  all  have  the  same  pin  configuration.  This  enables  the 
boards  to  operate  with  any  of  the  Am2924x  processors.  The  least  expensive  board  is 
the  SD29240  it  is  a  very  small  board,  similar  in  form  to  the  SA29200  board;  it  does 
not  have  the  expansion  connector  available  with  the  SA29200.  It  is  normally  sup¬ 
plied  with  an  Am29240  or  Am29245  installed.  There  is  1M  byte  of  32-bit  wide 
DRAM  which  operates  at  16  MHz.  When  an  Am29240  is  used,  Scalable  Clocking 
can  enable  the  processor  to  operate  at  32  MHz.  The  board  also  has  a  JTAG  and 
RS-232  connector.  The  1M  byte  of  32-bit  wide  EPROM  supplied  with  the  board  is 
preprogrammed  for  MiniMON29K  operation. 

Those  with  more  money  to  spend,  or  requiring  a  more  flexible  evaluation  board, 
can  use  the  SE29240  board.  It  contains  an  Am29243  processor  but  can  be  used  to 
evaluate  an  Am29240  or  Am29245.  Initially  the  board  contains  1M  byte  of  36-bit 
wide  DRAM.  However,  this  can  be  expanded  considerably.  The  DRAM  is  36-bits 
wide  due  to  the  additional  4-bits  required  for  parity  checking.  The  maximum 
memory  speed  is  25  MHz.  Scalable  Clocking  can  be  used  with  a  32  MHz  processor 
when  the  memory  system  is  configured  for  16  MHz  operation. 

The  SE29240  board  has  greater  EO  capability  than  the  SD29240  board.  There 
are  connectors  for  two  RS-232  ports  and  a  parallel  port.  Debugging  can  be  achieved 
via  a  serial  or  parallel  port  link  to  the  MiniMON29K  DebugCore  located  in  EPROM. 
Debugging  is  also  supported  via  the  JTAG  or  Logic  Analyzer  connections.  There  is  a 
small  wire-wrap  area  for  additional  circuitry,  and  extra  boards  can  be  connected  via 
an  expansion  connector. 

AMD  also  has  an  evaluation  board  intended  for  certain  communication  applica¬ 
tions.  The  NET29K  board  has  a  triple  processor  pad-site.  The  board  can  operate  with 
either  an  Am29205,  Am29200  or  Am2924x  (probably  an  Am29243)  processor.  The 
processor  pad  site  is  concentric,  the  larger  processor  being  at  the  outer  position.  The 
similarity  in  the  memory  region  controllers  enables  the  construction  of  this  unusual 
board. 

The  memory  system  consists  of  4M  bytes  of  36-bit  wide  DRAM,  which  is  ex¬ 
pandable.  There  is  also  2M  bytes  of  32-bit  EPROM.  The  EPROM  can  be  replaced 
with  1Mbyte  of  Flash  programmable  memory.  For  communications  there  is  an  AMD 
MACE  chip  which  provides  an  Ethernet  capability  via  an  10-Base-T  connector.  Two 
of  the  processors  DMA  channels  are  wired  for  MACE  access.  Once  channel  of  an 
85C30  UART  is  connected  to  an  RS-449  connector  which  supports  RS-422  signal 
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level  communication.  This  enables  very  fast  UART  communication.  The  Mini- 
MON29K  DebugCore  and  OS-boot  operating  system  are  initially  installed  in 
EPROM  (or  Flash);  and  the  DebugCore  communicates  via  an  on-chip  UART  con¬ 
nected  to  an  RS-232  (9-way)  connector. 

When  the  NET29K  board  is  used  with  an  Am29205  processor,  the  16-bit  pro¬ 
cessor  bus  enables  only  half  of  the  memory  system  to  be  accessed.  The  board  is 
physically  small,  measuring  about  5  1/2x5  1/2  inches  (14cm  x  14cm).  Debugging  is 
further  supported  by  JTAG  and  Logic  Analyzer  connections.  An  inexpensive  9-volt 
power  supply  is  required. 

1 .1 0  REGISTER  AND  MEMORY  SPACE 

Most  of  the  29K  instructions  operate  on  information  held  in  various  processor 
registers.  Load  and  store  type  instructions  are  available  for  moving  data  between  ex¬ 
ternal  memory  and  processor  registers.  Members  of  the  29K  family  generally  sup¬ 
port  registers  in  three  independent  register  regions  which  make  up  the  29K  register 
space.  These  regions  are  the  General  Purpose  registers,  Translation  Look-Aside 
(TLB)  registers,  and  Special  Purpose  registers.  Members  of  the  29K  family  which  do 
not  support  Memory  Management  Unit  operation,  do  not  have  TLB  registers  imple¬ 
mented. 

There  are  currently  two  core  processors  within  the  29K  family,  the  Am29000 
and  the  Am29050.  Other  processors  are  generally  derived  from  one  of  these  core  pro¬ 
cessors.  For  example,  the  Am29030  has  an  Am29000  at  its  core,  with  additional  sili¬ 
con  area  being  used  to  implement  instruction  cache  memory  and  a  2-bus  processor 
interface.  The  differences  between  the  core  processors  and  their  derivatives  is  re¬ 
flected  in  expansions  to  the  special  register  space. 

However,  the  special  register  space  does  appear  uniform  through  out  the  29K 
family.  Generally  only  those  concerned  with  generating  operating  system  support 
code  are  concerned  with  the  details  of  the  special  register  space.  AMD  has  specified  a 
subset  of  special  registers  which  are  supported  on  all  29K  family  processors.  This 
aids  in  the  development  and  porting  of  Supervisor  mode  code. 

The  core  processors  support  a  3-bus  Harvard  Architecture,  with  instructions 
and  data  being  held  in  separate  external  memory  systems.  There  is  one  32-bit  bus 
each  for  the  two  memory  systems  and  a  shared  32-bit  address  bus.  Some  RISC  chips 
have  a  4-bus  system,  where  there  is  an  address  bus  for  each  of  the  two  memory  sys¬ 
tems.  This  avoids  the  contention  for  use  of  a  shared  address  bus.  Unfortunately,  it  also 
results  in  increased  pin-count  and,  consequently,  processor  cost.  The  29K  3-bus  pro¬ 
cessors  avoid  conflicts  for  the  address  bus  by  supporting  burst  mode  addressing  and  a 
large  number  of  on-chip  registers.  It  has  been  estimated  that  the  Am29000  processor 
losses  only  5%  performance  as  a  result  of  the  shared  address  bus. 

All  instruction  fetches  are  directed  to  instruction  memory;  data  accesses  are  di¬ 
rected  to  data  memory  or  I/O  space.  These  two  externally  accessible  spaces  consti- 
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tute  two  of  the  four  external  access  spaces.  The  other  two  are  the  ROM  space  and  the 
coprocessor  space.  The  ROM  space  is  accessed  via  the  instruction  bus.  Like  the 
instruction  space  it  covers  a  232  range. 

1.10.1  General  Purpose  Registers 

All  members  of  the  family  have  general  purpose  registers  which  are  made  up 
from  128  local  registers  and  more  than  64  global  registers  (see  Figure  1-14).  These 
registers  are  the  primary  source  and  destination  for  most  29K  instructions.  Instruc¬ 
tions  have  three  8-bit  operand  fields  which  are  used  to  supply  the  addresses  of  gener¬ 
al  registers.  All  User  mode  executable  instructions  and  code  produced  by  high  level 
language  compilers,  are  restricted  to  only  directly  assessing  general  purpose  regis¬ 
ters.  The  fact  that  these  registers  are  all  32-bit  and  that  there  is  a  large  number  of 
them,  vis-a-vis  CISC,  reduces  the  need  to  access  data  held  in  external  memory. 

General  purpose  registers  are  implemented  by  a  multiport  register  file.  This  file 
has  a  minimum  of  three  access  ports,  the  Am29050  processor  has  an  additional  port 
for  writing-back  floating-point  results.  Two  of  the  three  ports  provide  simultaneous 
read  access  to  the  register  file;  the  third  port  is  for  updating  a  register  value.  Instruc¬ 
tions  generally  specify  two  general  purpose  register  operands  which  are  to  be  oper¬ 
ated  on.  After  these  operands  have  been  presented  to  the  execution  unit,  the  result  of 
the  operation  is  made  available  in  the  following  cycle.  This  allows  the  result  of  an 
integer  operation  to  be  written  back  to  the  selected  general  purpose  register  in  the 
cycle  following  its  execution.  At  any  instant,  the  current  cycle  is  used  to  write-back 
the  result  of  the  previous  computation. 

The  Am29050  can  execute  floating-point  operations  in  parallel  with  integer 
operations.  The  latency  of  floating-point  instructions  can  be  more  than  the  1-cycle 
achieved  by  the  integer  operation  unit.  Floating-point  results  are  written  back,  when 
the  operation  is  complete,  via  their  own  write-back  port,  without  disrupting  the  inte¬ 
ger  units  ability  to  write  results  into  the  general  purpose  register  file. 

Global  Registers 

The  8-bit  operand  addressing  fields  enable  only  the  lower  128  of  the  possible 
256  address  values  to  be  used  for  direct  general  purpose  register  addressing.  This  is 
because  the  most  significant  address  bit  is  used  to  select  a  register  base-plus-offset 
addressing  mode.  When  the  most  significant  bit  is  zero,  the  accessed  registers  are 
known  as  Global  Registers.  Only  the  upper  64  of  the  global  registers  are  implement¬ 
ed  in  the  register  file.  These  registers  are  known  as  gr64-grl27.  Some  of  the  lower 
address-value  global  registers  are  assigned  special  support  tasks  and  are  not  really 
general  purpose  registers. 

The  Am29050  processor  supports  a  condition  code  accumulator  with  global 
registers  gr2  and  gr3.  The  accumulator  can  be  used  to  concatenate  the  result  of  sever¬ 
al  Boolean  comparison  operations  into  a  single  condition  code.  Later  the  accumu- 
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lated  condition  can  be  quickly  tested.  These  registers  are  little  used  and  on  the  whole 
other,  more  efficient,  techniques  can  be  found  in  preference  to  their  use. 

Local  Registers 

When  the  most  significant  address  bit  is  set,  the  upper  128  registers  in  the  gener¬ 
al  purpose  register  file  are  accessed.  The  lower  7-bits  of  the  address  are  used  as  an 
offset  to  a  base  register  which  points  into  the  128  registers.  These  general  purpose 
registers  are  known  as  the  Local  Registers.  The  base  register  is  located  at  the  global 
register  address  grl .  If  the  addition  of  the  7-bit  operand  address  value  and  the  register 
base  value  produces  a  results  too  big  to  be  contained  in  the  7-bit  local  register  address 
space,  the  result  is  rounded  modulo-128.  When  producing  a  general  purpose  register 
address  from  a  local  register  address,  the  most  significant  bit  of  the  general  purpose 
register  address  value  is  always  set. 

The  local  register  base  address  can  be  read  by  accessing  global  register  grl. 
However,  the  base  register  is  actually  a  register  which  shadows  global  register  grl. 
The  shadow  support  circuitry  requires  that  the  base  be  written  via  an  ALU  operation 
producing  a  result  destined  for  grl .  This  also  requires  that  a  one  cycle  delay  follow 
the  setting  of  the  base  register  and  any  reference  to  local  registers. 

Global  register  address  grO  also  has  a  special  meaning.  Each  of  the  three  oper¬ 
and  fields  has  an  indirect  pointer  register  located  in  the  special  register  space.  When 
address  grO  is  used  in  an  operand  field,  the  indirect  pointer  is  used  to  access  a  general 
purpose  register  for  the  associated  operand.  Each  of  the  three  indirect  pointers  has  an 
8-bit  field  and  can  point  anywhere  in  the  general  purpose  register  space.  When  indi¬ 
rect  pointers  are  used,  there  is  no  distinction  made  between  global  and  local  registers. 

All  of  the  general  purpose  registers  are  accessible  to  the  processor  while  execut¬ 
ing  in  User  mode  unless  register  bank  protection  is  applied.  General  purpose  registers 
starting  with  gr64  are  divided  into  groups  of  16  registers.  Each  group  can  have  access 
restricted  to  the  processor  operating  in  Supervisor  mode  only.  The  AMD  high  level 
language  calling  convention  specifies  that  global  registers  gr64-gr95  be  reserved  for 
operating  system  support  tasks.  For  this  reason  it  is  normal  to  see  the  special  register 
used  to  support  register  banking  set  to  disable  User  mode  access  to  global  registers 
gr64-gr95. 

1.10.2  Special  Purpose  Registers 

Special  purpose  register  space  is  used  to  contain  registers  which  are  not  ac¬ 
cessed  directly  by  high  level  languages.  Registers  such  as  the  program  counter  and 
the  interrupt  vector  table  base  pointer  are  located  in  special  register  space.  Normally 
these  registers  are  accessed  by  operating  system  code  or  assembly  language  helper 
routines.  Special  registers  can  only  be  accessed  by  move-to  and  move-from  type 
instructions;  except  for  the  move-to-immediate  case.  Move-to  and  move-from 
instructions  require  the  use  of  a  general  purpose  register.  It  is  worth  noting  that 
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move-to  special  register  instructions  are  among  a  small  group  of  instructions  which 
cause  processor  serialization.  That  is,  all  outstanding  operations,  such  as  overlapping 
load  or  store  instructions,  are  completed  before  the  serializing  instruction  com¬ 
mences. 

Special  register  space  is  divided  into  two  regions  (see  Figure  1-15).  Those  reg¬ 
isters  whose  address  is  below  srl28  can  only  be  accessed  by  the  processor  operating 
in  Supervisor  mode.  Different  members  of  the  29K  family  have  extensions  to  the 
global  registers  shown  in  Figure  1-15.  However,  special  registers  sr0-srl4  are  a  sub¬ 
set  which  appear  in  all  family  members.  Certain,  generally  lower  cost,  family  mem¬ 
bers  such  as  the  Am29005  processor,  which  have  no  memory  management  unit,  do 
not  have  the  relevant  MMU  support  registers  ( srl3  and  srl4).  I  shall  first  describe  the 
restricted  access,  or  protected,  special  registers.  I  shall  not  go  into  the  exact  bit-field 
operations  in  detail,  for  an  expansion  of  field  meanings  see  later  chapters  or  the  rele¬ 
vant  processor  User’s  Manual.  The  objective  here  is  to  provide  a  framework  for  bet¬ 
ter  understanding  the  special  register  space. 

Special  registers  are  not  generally  known  by  their  special  register  number.  For 
example,  the  program  counter  buffer  register  PC  1  is  known  as  PC  1  by  assembly  lan¬ 
guage  programming  tools  rather  than  srll . 

Vector  Area  Base 

Special  register  srO,  better  known  as  VAB,  is  a  pointer  to  the  base  of  a  table  of 
256  address  values.  Each  interrupt  or  trap  is  assigned  a  unique  vector  number.  Vector 
numbers  0-63  are  assigned  to  specific  processor  support  tasks.  When  an  interrupt  or 
trap  exception  is  taken,  the  vector  number  is  used  to  index  the  table  of  address  values. 
The  identified  address  value  is  read  and  used  as  the  start  address  of  the  exception 
handling  routine.  Alternatively  with  3-bus  members  of  the  29K  family,  the  vector 
table  can  contain  256  blocks  of  instructions.  The  VF  bit  (vector  fetch)  in  the  proces¬ 
sor  Configuration  register  (CFG)  is  used  to  select  the  vector  table  configuration. 
Each  block  is  limited  to  64  instructions,  but  via  this  method  the  interrupt  handler  can 
be  reached  faster  as  the  start  of,  say,  an  interrupt  handler  need  not  be  preceded  by  a 
fetch  of  the  address  of  the  handler.  In  practice  the  table  of  vectors  to  handlers,  rather 
than  handlers  themselves,  is  predominantly  used  due  to  the  more  efficient  use  of 
memory.  For  this  reason  the  two  later  2-bus  members  of  the  29K  family  only  support 
the  table  of  vectors  method;  and  the  VF  bit  in  the  CFG  register  is  reserved  and  effec¬ 
tively  set. 

The  first  29K  processor,  the  Am29000,  has  a  VAB  register  which  requires  the 
base  of  the  vector  table  to  be  aligned  to  a  64k  byte  address  boundary.  This  can  be  in¬ 
convenient  and  lead  to  memory  wastage.  More  recent  family  members  provide  for  a 
lk  byte  boundary.  Because  the  3-bus  family  members  support  instructions  being  lo¬ 
cated  in  Instruction  space  and  ROM  space  (memory  space  is  described  in  section 
1.10.4),  it  is  possible  with  these  processors  to  specify  that  handler  routines  are  in 
ROM  space  by  setting  the  RV  bit  (ROM  vector  area)  in  the  CFG  register  when  the  VF 
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Special  Purpose  Mnemon- 

Reg.  No.  Protected  Registers  jc 


0 

Vector  Area  Base  Address 

VAB 

1 

Old  Processor  Status 

OPS 

2 

Current  Processor  Status 

CPS 

3 

Configuration 

CFG 

4 

Channel  Address 

CHA 

5 

Channel  Data 

CHD 

6 

Channel  Control 

CHC 

7 

Register  Bank  Protect 

RBP 

8 

Timer  Counter 

TMC 

9 

Timer  Reload 

TMR 

10 

Program  Counter  0 

PCO 

11 

Program  Counter  1 

PCI 

12 

Program  Counter  2 

PC2 

13 

MMU  Configuration 

MMU 

14 

LRU  Recommendation 

LRU 

Unprotected  Registers 

128 

Indirect  Pointer  C 

IPC 

129 

Indirect  Pointer  A 

IPA 

130 

Indirect  Pointer  B 

IPB 

131 

Q 

Q 

132 

ALU  Status 

ALU 

133 

Byte  pointer 

BP 

134 

Funnel  Shift  Count 

FC 

135 

Load/Store  Count  Remaining 

CR 

160 

Floating-Point  Environment 

FPE 

161 

Integer  Environment 

INTE 

162 

Floating-Point  Status 

FPS 

Figure  1  -15.  Special  Purpose  Register  Space  for  the  Am29000  Microprocessor 
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bit  is  zero.  Or,  when  the  more  typical  table  of  vectors  method  is  being  used  by,  setting 
bit-1  of  the  handler  address.  Since  handler  routines  all  start  on  4-byte  instruction 
boundaries,  bits  0  and  1  of  the  vector  address  are  not  required  to  hold  address  in¬ 
formation.  The  2-bus  and  microcontroller  members  of  the  29K  family  do  not  support 
ROM  space  and  RV  bit  in  the  CFG  registers  is  reserved. 

Processor  Status 

Two  special  registers,  srl  and  sr2,  are  provided  for  processor  status  reporting 
and  control.  The  two  registers  OPS  (old  processor  status)  and  CPS  (current  processor 
status)  have  the  same  bit-field  format.  Each  bit  position  has  been  assigned  a  unique 
task.  Some  bit  positions  are  not  effective  with  particular  family  members.  For  exam¬ 
ple,  the  Am29030  processor  does  not  use  bit  position  1 5  (CA).  This  bit  is  used  to  indi¬ 
cate  coprocessor  activity.  Only  the  3-bus  family  members  support  coprocessor  op¬ 
eration  in  this  way. 

The  CPS  register  reports  and  controls  current  processor  operation.  Supervisor 
mode  code  is  often  involved  with  manipulating  this  register  as  it  controls  the  enabling 
and  disabling  of  intermpts  and  address  translation.  When  a  program  execution  ex¬ 
ception  is  taken,  or  an  external  event  such  as  an  interrupt  occurs,  the  CPS  register 
value  is  copied  to  the  OPS  register  and  the  processor  modifies  the  CPS  register  to 
enter  Supervisor  mode  before  execution  continues  in  the  selected  exception  handling 
routine.  When  returning  from  the  handler  routine,  the  interrupted  program  is  re¬ 
started  with  an  IRET  type  instruction.  Execution  of  an  IRET  instruction  causes  the 
OPS  register  to  be  copied  back  to  the  CPS  register,  helping  to  restore  the  interrupted 
program  context.  Supervisor  mode  code  often  prepares  OPS  register  contents  before 
executing  an  IRET  and  starting  User  mode  code  execution. 

Configuration 

Special  register  sr3,  known  as  the  configuration  control  register  (CFG),  esta¬ 
blishes  the  selected  processor  operation.  Such  options  as  big  or  little  endian  byte  or¬ 
der,  cache  enabling,  coprocessor  enabling,  and  more  are  selected  by  the  CFG  setting. 
Normally  this  register  value  is  established  at  processor  boot-up  time  and  is  infre¬ 
quently  modified. 

The  original  Am29000  (rev  C  and  later)  only  used  the  first  six  bits  of  the  CFG 
register  for  processor  configuration.  Later  members  of  the  family  offer  the  selection 
of  additional  processor  options,  such  as  instruction  memory  cache  and  early  address 
generation.  Additional  options  are  supported  by  extensions  to  the  CFG  bit-field  as¬ 
signment.  Because  there  is  no  overlap  with  CFG  bit-field  assignment  across  the  29K 
family,  and  family  members  offer  a  matrix  of  functionality,  there  are  often  reserved 
bit-fields  in  the  CFG  register  for  any  particular  29K  processor.  The  function  pro¬ 
vided  at  each  bit  position  is  unique  and  if  the  function  is  not  provided  for  by  a  proces¬ 
sor,  the  bit  position  is  reserved. 
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The  upper  8-bits  of  the  CFG  register  are  used  for  processor  version  and  revision 
identification.  The  upper  3-bits  of  this  field,  known  as  the  PRL  (processor  revision 
level)  identify  the  processor.  The  Am29000  processor  is  identified  by  processor 
number  0,  the  Am29050  is  processor  number  1,  and  so  on.  The  lower  5-bits  of  the 
PRL  give  the  the  revision  level;  a  value  of  3  indicates  revision  ‘D’ .  The  PRL  field  is 
read-only. 

Data  Access  Channel 

Three  special  registers,  sr4-sr6 ,  known  as  CHA  (channel  address),  CHD  (chan¬ 
nel  data)  and  CHC  (channel  control),  are  used  to  control  and  record  all  access  to  ex¬ 
ternal  data  memory.  Processors  in  the  29K  family  can  perform  data  memory  access  in 
parallel  with  instruction  execution.  This  offers  a  considerable  performance  boost, 
particularly  where  there  is  high  data  memory  access  latency.  Parallel  operation  can 
only  occur  if  the  instruction  pipeline  can  be  kept  fed  from  the  instruction  prefetch 
buffer  (IPB),  instruction  memory  cache,  or  via  separate  paths  to  data  and  instruction 
memory  (Harvard  style  3-bus  processors).  It  is  an  important  task  of  a  high  level  lan¬ 
guage  compiler  to  schedule  load  and  store  instructions  such  that  they  can  be  success¬ 
fully  overlapped  with  other  nondependent  instructions  (see  section  1.13). 

When  data  memory  access  runs  in  parallel,  its  completion  will  occur  some  time 
after  the  instruction  originally  making  the  data  access.  In  fact  it  could  be  several 
cycles  after  the  original  request,  and  it  may  not  be  possible  to  determine  the  original 
instruction.  On  many  processors,  keeping  track  of  the  original  instruction  is  required 
in  case  the  load  or  store  operation  does  not  complete  for  some  reason.  The  original 
instruction  is  restarted  after  the  interrupting  complication  has  been  dealt  with.  How¬ 
ever,  with  the  29K  family  the  original  instruction  is  not  restarted.  All  access  to  exter¬ 
nal  memory  is  via  the  processor  Data  Channel.  The  three  channel  support  registers 
are  used  to  restart  any  interrupted  load  or  store  operation.  Should  an  exception  occur 
during  data  memory  access,  such  as  an  address  translation  fault,  memory  access 
violation,  or  external  interrupt,  the  channel  registers  are  updated  by  the  processor  re¬ 
porting  the  state  of  the  in-progress  memory  access. 

The  channel  control  register  (CHC)  contains  a  number  of  bit-fields.  The  con- 
tents-valid  bit  (CV)  indicates  that  the  channel  support  registers  currently  describe  a 
valid  data  access.  The  CV  bit  is  normally  seen  set  when  a  channel  operation  is  inter¬ 
rupted.  The  ML  bit  indicates  a  load-  or  store-multiple  operation  is  in  progress. 
LOADM  and  STOREM  instructions  set  this  bit  when  commencing  and  clear  it  when 
complete.  It  is  important  to  note  that  non-multiple  LOAD  and  STORE  instructions 
do  not  set  or  clear  the  ML  bit.  When  a  load-  or  store-multiple  operation  is  interrupted 
and  nested  interrupt  processing  is  supported,  it  is  not  sufficient  to  just  clear  the  CV  bit 
to  temporary  cancel  the  channel  operation.  If  the  ML  bit  was  left  set,  a  subsequent 
load  or  store  operation  would  become  confused  with  a  multiple  type  operation.  The 
ML  bit  should  be  cleared  along  with  the  CV  bit;  this  is  best  done  by  writing  zero  into 
the  CHC  register.  (See  section  4.3.8  for  more  information  about  clearing  CHC.) 
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Integer  operations  complete  in  a  single  cycle,  enabling  the  result  of  the  previous 
integer  operation  to  be  written  back  to  the  general  purpose  register  file  in  the  current 
cycle.  Because  external  memory  reads  are  likely  to  take  several  cycles  to  complete, 
and  pipeline  stalling  is  to  be  avoided,  the  accessed  data  value  is  not  written  back  to  the 
global  register  file  during  the  following  instruction  (the  write-back  cycle).  This  re¬ 
sults  in  the  load  data  being  held  by  the  processor  until  access  to  the  write-back  port  is 
available.  This  is  certain  to  occur  during  the  execution  of  any  future  load  or  store 
instruction  which  itself  can  not  make  use  of  its  own  write-back  cycle.  The  processor 
makes  available  via  loadforwarding  circuitry  the  load  data  which  awaits  write-back 
to  the  register  file. 

Register  Access  Protection 

Special  register  sr7,  known  as  RBP  (register  bank  protect),  provides  a  means  to 
restrict  the  access  of  general  purpose  registers  by  programs  executing  in  User  mode. 
General  purpose  registers  starting  with  gr64  are  divided  into  groups  of  16  registers. 
When  the  corresponding  bit  in  the  RBP  register  is  set,  the  associated  bank  of  16  regis¬ 
ters  is  protected  from  U ser  mode  access .  The  RB  P  register  is  typically  used  to  prevent 
User  mode  programs  from  accessing  Supervisor-maintained  information  held  in 
global  registers  gr64-gr95.  These  registers  are  reserved  by  the  AMD  high  level  lan¬ 
guage  calling  convention  for  system  level  information. 

On-Chip  Timer  Control 

Special  registers  sr8  and  sr9,  known  as  TMC  (timer  counter)  and  TMR  (timer 
reload  value),  support  a  24-bit  real-time  clock.  The  TMC  register  decrements  at  the 
rate  of  the  processor  clock.  When  it  reaches  zero  it  will  generate  an  interrupt  if  en¬ 
abled.  In  conjunction  with  support  software  these  two  registers  can  be  used  to  imple¬ 
ment  many  of  the  functions  often  supported  by  off-chip  timer  circuitry. 

Program  Counter 

A  29K  processor  contains  a  Master  and  Slave  PC  (program  counter)  address 
register.  The  Master  PC  register  contains  the  address  of  the  instruction  currently  be¬ 
ing  fetched.  The  Slave  contains  the  next  sequentional  instruction.  Once  an  instruc¬ 
tion  flows  into  the  execution  unit,  unless  interrupted,  the  following  instruction,  cur¬ 
rently  in  decode,  will  always  flow  into  the  execution  unit.  This  is  true  for  all  instruc¬ 
tions  except  for  instructions  such  as  IRET.  Even  if  the  instruction  in  execute  is  a 
jump-type,  the  following  instruction  known  as  the  delay-slot  instruction  is  executed 
before  the  jump  is  taken.  This  is  known  as  delayed  branching  and  can  be  very  useful 
in  hiding  memory  access  latencies,  as  the  processor  pipeline  can  be  kept  busy  execut¬ 
ing  the  delay-slot  instruction  while  the  new  instruction  sequence  is  fetched.  It  is  an 
important  activity  of  high  level  language  compilers  to  find  useful  instructions  to 
place  in  delay-slot  locations. 

The  Master  PC  value  flows  along  the  PC-bus  and  the  bus  activity  is  recorded  by 
the  PC  buffer  registers,  see  Figure  1-16.  There  are  three  buffer  registers  arranged  in 


54 


Evaluating  and  Programming  the  29K  RISC  Family 


sequence.  These  buffer  registers  are  accessible  within  special  registers’  space  as 
srl0-srl2,  better  known  as  PCO,  PC  1  and  PC2.  The  PCO  register  contains  the  address 
of  the  instruction  currently  in  decode;  register  PCI  contains  the  address  of  the 
instruction  currently  in  execute;  and  PC2  the  instruction  now  in  write-back. 
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Figure  1-16.  Am29000  Processor  Program  Counter 


When  a  program  exception  occurs  the  PC-buffer  registers  become  frozen.  This 
is  signified  by  the  FZ  bit  in  the  current  processor  status  register  being  set.  When  fro¬ 
zen,  the  PC-buffer  registers  accumulate  no  new  PC-bus  information.  The  frozen  PC 
information  can  be  used  later  to  restart  program  execution.  An  IRET  instruction 
causes  the  PC  1  and  PCO  register  information  to  be  copied  to  the  Master  and  Slave  PC 
registers  and  instruction  fetching  to  commence.  For  this  reason  it  is  important  to 
maintain  both  PC  1  and  PCO  values  when  dealing  with  such  system  level  activities  as 
nested  interrupt  servicing.  Since  the  PC2  register  records  the  address  of  a  now 
executed  instruction,  maintenance  of  its  value  is  less  important;  but  it  can  play  an  im¬ 
portant  role  in  debugging 

When  a  CALL  instruction  is  executed,  the  B-bus  supplies  the  Master  PC  with 
the  address  of  the  new  instruction  stream.  Earlier,  when  the  CALL  instruction  en¬ 
tered  the  decode  stage,  the  PC-bus  was  used  to  fetch  the  delay-slot  instruction;  and 
the  address  of  the  instruction  following  the  delay-slot  (the  return  address)  was  pre¬ 
pared  for  entry  into  the  Slave  PC.  On  the  following  cycle,  the  CALL  instruction  en- 
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ters  the  execute  stage  and  the  return  address  enters  the  Return  Address  register.  Dur¬ 
ing  CALL  execution,  the  return  address  is  transferred  to  the  register  file  via  the  R- 
BUS. 

MMU  control 

The  last  of  the  generally  available  special  registers  are  concerned  with  memory 
management  unit  (MMU)  operation.  Processors  which  have  the  Translation  Look- 
Aside  Buffer  (TLB)  registers  omitted  will  not  have  these  two  special  registers.  The 
operation  of  the  MMU  is  quite  complex,  and  Chapter  6  is  fully  dedicated  to  the  de¬ 
scription  of  its  operation.  Many  computer  professionals  working  in  real-time  proj¬ 
ects  may  be  unfamiliar  with  MMU  operation.  The  MMU  enables  virtual  addresses 
generated  by  the  processor  to  be  translated  into  physical  memory  addresses.  Addi¬ 
tionally,  memory  is  divided  into  page  sized  quantities  which  can  be  individually  pro¬ 
tected  against  User  mode  or  Supervisor  mode  read  and  write  access. 

Special  register  srl3,  known  as  MMU,  is  used  to  select  the  page  size;  a  mini¬ 
mum  of  lk  bytes,  and  a  maximum  of  8k  bytes.  Also  specified  is  the  current  User 
mode  process  identifier.  Each  User  mode  process  is  given  a  unique  identifier  and  Su¬ 
pervisor  mode  processes  are  assumed  to  have  identifier  0. 

Certain  newer  29K  processors  support  two  TLB  systems  on-chip.  Each  TLB 
has  a  independently  programmable  page  size.  These  processors,  and  their  close  rela¬ 
tives  can  be  programmed  for  a  maximum  page  size  of  16M  bytes. 

Additional  Protected  Special  Registers 

Monitor  Mode 

Some  newer  members  of  the  29K  family  have  additional  Supervisor  only  acces¬ 
sible  special  registers  which  are  addressed  above  srl4.  Figure  1-17  shows  the  addi¬ 
tional  special  registers  for  processors  which  support  Monitor  mode.  Special  register 
srl5,  known  as  RSN  (reason  vector),  records  the  trap  number  causing  Monitor  mode 

Special  Purpose  Mnemonic 

Re9-  No.  Protected  Registers 
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Shadow  Program  Counter  2 
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Figure  1  -1 7.  Additional  Special  Purpose  Registers  for  the  Monitor  Mode  Support 
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to  be  entered.  Monitor  mode  extends  the  software  debugging  capability  of  the  pro¬ 
cessor;  it  was  briefly  described  in  the  previous  section  describing  the  processor  fea¬ 
tures,  and  is  dealt  with  in  detail  in  later  chapters.  The  shadow  Program  Counter  regis¬ 
ters  constituted  a  second  set  of  PC-buffer  registers.  They  record  the  PC-bus  activity 
and  are  used  to  support  Monitor  mode  debugging. 


Am29050 

Figure  1-18  shows  the  additional  special  registers  used  by  the  Am29050  pro¬ 
cessor  for  region  mapping.  In  the  Am29050  case,  the  additional  special  registers  sup¬ 
port  two  functions:  debugging  and  region  mapping.  Four  special  registers  in  the 
range  srl6-srl  9  extend  the  virtual  address  mapping  capabilities  of  the  TLB  registers. 
They  support  the  mapping  of  two  regions  which  are  of  programmable  size.  Their  use 
reduces  the  demand  placed  on  TLB  registers  to  supply  all  of  a  systems  address  map¬ 
ping  and  memory  access  protection  requirements. 
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Figure  1  -1 8.  Additional  Special  Purpose  Registers  for  the  Am29050  Microprocessor 


Instruction  and  Data  Breakpoints 

Figure  1-19  shows  the  additional  special  registers  for  processors  which  support 
breakpoint  debugging.  They  facilitate  the  control  of  separate  instruction  access 
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Figure  1-19.  Additional  Special  Purpose  Registers  for  Breakpoint  Control 
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breakpoints  and  data  access  breakpoints.  Some  29K  processors  have  instruction 
breakpoints  only;  others  support  both  types  of  breakpoint. 

On-Chip  Cache  Control 

Figure  1-20  shows  the  additional  special  registers  required  to  access  on-chip 
cache.  There  are  only  two  additional  registers,  sr29  and  sr30,  required.  Both  registers 
are  used  for  communicating  with  the  instruction  memory  cache  supported  by  many 
29K  processors.  If  a  processor  also  contains  data  cache,  the  memory  can  similarly  be 
accessed  via  the  same  cache  interface  registers.  Supervisor  mode  support  code  con¬ 
trols  cache  operation  via  the  processor  configuration  register  (CFG),  and  is  not  likely 
to  make  use  of  the  cache  interface  registers.  These  registers  may  be  used  by  debug¬ 
gers  and  monitors  to  preload  and  examine  cache  memory  contents. 
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Figure  1-20.  Additional  Special  Purpose  Registers  for  On-Chip  Cache  Control 


User  Mode  Accessible  Special  Registers 

Figure  1-15  showed  the  special  register  space  with  its  two  regions.  The  region 
addressed  above  srl28  is  always  accessible;  and  below  srl28,  registers  are  only  ac¬ 
cessible  to  the  processor  when  operating  in  Supervisor  mode. 

The  original  Am29000  processor  defined  a  subset  of  User  mode  accessible  reg¬ 
isters,  in  fact  those  shown  in  Figure  1-15.  Every  29K  processor  supports  the  use  of 
these  special  registers,  but,  only  the  Am29050  has  the  full  complement  implemented. 

Registers  in  the  range  srl28-srl35  are  always  present.  However,  the  three  reg¬ 
isters  srl60-srl62  are  used  to  support  floating-point  and  integer  operations.  Only 
certain  members  of  the  29K  family  directly  support  these  operations  in  processor 
hardware.  Other  29K  family  members  virtualize  these  three  registers.  When  not 
available,  an  attempt  to  access  them  causes  a  protection  violation  trap.  The  trap  han¬ 
dler  identifies  the  attempted  operation  and  redirects  the  access  to  shadow  copies  of 
the  missing  registers.  The  accessor  is  unaware  that  the  virtualization  has  occurred, 
accept  for  the  delay  in  completing  the  requested  operation.  In  practice,  floating-point 
supporting  special  registers  are  not  frequently  accessed;  except  for  the  case  of  float¬ 
ing-point  intensive  systems  which  tend  to  be  constructed  around  an  Am29050  pro¬ 
cessor. 
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Indirect  Pointers 

Special  registers  srl28-srl30 ,  better  known  as  IPA,  IPB  and  IPC,  are  the  indi¬ 
rect  pointers  used  to  access  the  general  purpose  register  file.  For  instructions  which 
make  use  of  the  three  operand  fields,  RA,  RB  and  RC,  to  address  general  purpose 
registers,  the  indirect  pointer  can  be  used  as  an  alternative  operand  address  source. 
For  example,  the  RA  operand  field  supplies  the  register  number  for  the  source  oper- 
and-A;  if  global  register  address  grO  is  used  in  the  RA  instruction  field,  then  the  oper¬ 
and  register  number  is  provided  by  the  IPA  register. 

The  IPA,  IPB  and  IPC  registers  are  pointers  into  the  global  register  file.  They  are 
generally  used  to  point  to  parameters  passed  to  User  mode  helper  routines.  They  are 
also  used  to  support  instruction  emulation,  where  trap  handler  routines  perform  in 
software  the  missing  instruction.  The  operands  for  the  emulated  instruction  are 
passed  to  the  trap  handler  via  the  indirect  pointers. 

ALU  Support 

Special  registers  srl31-srl34  support  arithmetic  unit  operation.  Register 
srl31,  better  known  as  Q,  is  used  during  floating-point  and  integer  multiply  and  di¬ 
vide  steps.  Only  the  Am29050  processor  can  perform  floating-point  operations  di¬ 
rectly,  that  is,  without  coprocessor  or  software  emulation  help.  It  is  also  the  only  pro¬ 
cessor  which  directly  supports  integer  multiply.  All  other  current  members  of  the 
29K  family  perform  these  operations  in  a  sequence  of  steps  which  make  use  of  the  Q 
register. 

The  result  of  a  comparison  instruction  is  placed  in  a  general  purpose  register,  as 
well  as  in  the  condition  field  of  the  ALU  status  register  (special  register  srl  32).  How¬ 
ever,  the  ALU  status  register  is  not  conveniently  tested  by  such  instructions  as  condi¬ 
tional  branch.  Branch  decisions  are  made  on  the  basis  of  True  or  False  values  held  in 
general  purpose  registers.  This  makes  a  lot  of  sense,  as  contention  for  use  of  a  single 
resource  such  as  the  ALU  status  register  would  lead  to  a  resource  conflict  which 
would  likely  result  in  unwanted  pipeline  stalling. 

The  ALU  status  register  controls  and  reports  the  operation  of  the  processor  inte¬ 
ger  operation  unit.  It  is  divided  into  a  number  of  specialized  fields  which,  in  some 
cases,  can  be  more  conveniently  accessed  via  special  registers  srl 34  and  srl 35.  The 
short  hand  access  provided  by  these  additional  registers  avoids  the  read,  shift  and 
mask  operations  normally  required  before  writing  to  bit-fields  in  the  ALU  register. 

Data  Access  Channel 

The  three  channel  control  registers,  CHA,  CHD  and  CHC,  were  previously  de¬ 
scribed  in  the  protected  special  registers  section.  However,  User  mode  programs 
have  a  need  to  establish  load-  and  store-multiple  operations  which  are  controlled  by 
the  channel  support  registers.  Special  register  srl 35,  known  as  CR,  provides  a  means 
for  a  User  mode  program  to  set  the  Count  Remaining  field  of  the  protected  CHC  reg¬ 
ister.  This  field  specifies  the  number  of  consecutive  words  transferred  by  the  multiple 
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data  move  operation.  Should  the  operation  be  interrupted  for  any  reason,  the  CR  field 
reports  the  number  of  transfers  yet  to  be  completed.  Channel  operation  is  typically 
restarted  (if  enabled)  when  an  IRET  type  instruction  is  issued. 

Instruction  Environment  Registers 

Special  registers  srl60  and  srl62,  known  as  FPE  and  FPS,  are  the  floating¬ 
point  environment  and  status  registers.  The  environment  register  is  used  by  User 
mode  programs  to  establish  the  required  floating-point  operations,  such  as  double- 
or  single-precision,  IEEE  specification  conformance,  and  exception  trap  enabling. 
The  status  register  reports  the  outcome  of  floating-point  operations.  It  is  typically 
examined  as  a  result  of  a  floating-point  operation  exception  occurring.  Only  proces¬ 
sors  (Am29050)  which  support  floating-point  operations  directly  (free  of  trapware) 
have  real  srl61  and  srl62  registers.  All  other  processors  appear  to  have  these  regis¬ 
ters  via  trapware  support  which  creates  virtual  registers. 

The  integer  environment  is  established  by  setting  special  register  srl 61,  known 
as  INTE.  There  are  two  control  bits  which  separately  enable  integer  and  multiplica¬ 
tion  overflow  exceptions.  If  exception  detection  is  enabled,  the  processor  will  take 
an  Out-of-Range  trap  when  an  overflow  occurs.  Only  processors  (Am29040, 
Am29240  and  Am29243)  which  support  integer  multiply  directly  (free  of  trapware) 
have  a  real  srl61  register.  All  other  processors  appear  to  have  an  srl61  register  via 
trapware  support. 

Additional  User  Mode  Special  Registers 
Am29050 

The  Am29050  has  an  additional  special  register,  shown  in  Figure  1-21.  Register 
srl 64,  known  as  EXOP,  reports  the  instruction  operation  code  causing  a  trap.  It  is 
used  by  floating-point  instruction  exceptions.  Unlike  other  29K  processors  the 
Am29050  directly  executes  all  floating-point  instructions.  Exception  traps  can  oc¬ 
cur  during  these  operations.  When  instruction  emulation  techniques  are  being  used,  it 
is  an  easy  matter  to  determine  the  instruction  being  emulated  at  the  time  of  the  trap. 
However,  with  direct  execution  things  are  not  as  simple.  The  processor  could  ex¬ 
amine  the  memory  at  the  address  indicated  by  the  PC-buffer  registers  to  determine 
the  relevant  instruction  opcode.  But  the  Am29050  supports  a  Harvard  memory  archi¬ 
tecture  and  there  is  no  path  within  the  processor  to  access  the  instruction  memory  as  if 
it  were  data.  The  EXOP  register  solves  this  problem.  Whenever  an  exception  trap  is 
taken,  the  EXOP  register  reports  the  opcode  of  the  instruction  causing  the  exception. 

Users  of  other  3-bus  Harvard  type  processors  such  as  the  Am29000  and 
Am29005  should  take  note;  virtualizing  the  unprotected  special  registers  srl 60-1 62 
requires  that  the  instruction  space  be  readable  by  the  processor  (virtualizing,  in  this 
case,  means  making  registers  srl 60-1 62  appear  to  be  accessible  even  when  they  are 
not  physically  present).  This  can  only  be  achieved  by  connecting  the  instruction  and 
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Special  Purpose 
Reg.  No. 


Unprotected  Registers 


Mnemonic 


164 


Exception  Opcode 


EXOP 


Figure  1-21.  Additional  Special  Purpose  Register  for  the  Am29050  Microprocessor 


data  busses  together  (disabling  the  Harvard  architecture  advantages  by  creating  a 
2-bus  system)  or  providing  an  off-chip  bridge.  This  bridge  must  enable  the  address 
space  to  be  reached  from  within  some  range  of  data  memory  space,  at  least  for  word- 
size  read  accesses,  and,  all  be  it,  with  additional  access  time  penalties. 

The  Am29050  processor  has  an  additional  group  of  registers  known  as  the  float¬ 
ing-point  accumulators.  There  are  four  64-bit  accumulators  ACC3-0  which  can  be 
used  with  certain  floating-point  operations.  They  can  hold  double-  or  single-preci¬ 
sion  numbers.  They  are  not  special  registers  in  the  sense  they  lie  in  special  register 
space.  They  are  located  in  their  own  register  space,  giving  the  Am29050  one  more 
register  space  than  the  normal  three  register  spaces  of  the  other  29K  family  members. 
However,  like  special  registers,  they  can  only  be  accessed  by  move-to  and  move- 
from  accumulator  type  instructions. 

Double-precision  numbers  (64-bit)  can  be  moved  between  accumulators  and 
general  registers  in  a  single  cycle.  Global  registers  are  used  in  pairs  for  this  operation. 
This  is  possible  because  the  Am29050  processor  is  equipped  with  an  additional 
64-bit  write-back  port  for  floating  point  data,  and  the  register  file  is  implemented 
with  a  width  of  64-bits. 

1.10.3  Translation  Look-Aside  Registers 

Although  some  29K  family  members  are  equipped  with  region  mapping  regis¬ 
ters,  a  Translation  Look-Aside  Buffer  (TLB)  technique  is  generally  used  to  provide 
virtual  to  physical  address  translation.  The  TLB  is  two-way  set  associative  and  up  to 
64  translations  are  cached  in  the  TLB  support  registers. 

The  TLB  registers  form  the  basis  for  implementing  a  Memory  Management 
Unit.  The  scheme  for  reloading  TLB  registers  is  not  dictated  by  processor  micorcode, 
but  left  to  the  programmer  to  organize.  This  enables  a  number  of  performance  boost¬ 
ing  schemes  to  be  implemented  with  low  overhead  costs.  However,  it  does  place  the 
burden  of  creating  a  TLB  maintenance  scheme  on  the  user.  Those  used  to  having  to 
work  around  a  processor’s  microcode  imposed  scheme  will  appreciate  the  freedom. 

TLB  registers  can  only  be  accessed  by  move-to  TLB  and  move-from  TLB 
instructions  executed  by  the  processor  operating  in  Supervisor  mode.  Each  of  the 
possible  64  translation  entries  (less  than  64  with  some  29K  family  members)  requires 
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a  pair  of  TLB  registers  to  fully  describe  the  address  translation  and  access  permis¬ 
sions  for  the  mapped  page.  Pages  are  programmable  in  size  from  lk  bytes  to  8k  bytes 
(to  16M  byte  with  newer  29K  processors),  and  separate  read,  write  and  execute  per¬ 
missions  can  be  enabled  for  User  mode  and  Supervisor  mode  access  to  the  mapped 
page. 

There  is  only  a  single  32-bit  virtual  address  space  supported.  This  space  is 
mapped  to  real  instruction,  data  or  I/O  memory.  Address  translation  is  performed  in  a 
single  cycle  which  is  overlapped  with  other  processor  operations.  This  results  in  the 
use  of  an  MMU  not  imposing  any  run-time  performance  penalties,  except  where 
TLB  misses  occur  and  the  TLB  cache  has  to  be  refilled.  Each  TLB  entry  is  tagged 
with  a  per-process  identifier,  avoiding  the  need  to  flush  TLB  contents  when  a  user- 
task  context  switch  occurs.  Chapter  6  fully  describes  the  operation  of  the  TLB. 

1.10.4  External  Address  Space 

The  3-bus  members  of  the  29K  family  support  five  external  32-bit  address 
spaces.  They  are: 

■  Data  Memory  —  accessed  via  the  data  bus. 

■  Input/Output  —  also  accessed  via  the  data  bus. 

■  Instruction  —  accessed  via  the  instruction  bus,  normally  read-only. 

■  ROM  —  also  accessed  via  the  instruction  bus,  normally  read-only. 

■  Coprocessor  —  accessed  via  both  data  and  address  busses.  Note,  the  address 
bus  is  only  used  for  stores  to  coprocessor  space.  This  enables  64-bit  transfers 
during  stores  and  32-bit  during  loads. 

The  address  bus  is  used  for  address  information  when  accessing  all  address 
spaces  except  the  coprocessor  space.  During  load  and  store  operations  to  coprocessor 
space,  address  information  can  be  supplied  in  a  limited  way  by  the  OPT2-0  field  of 
the  load  and  store  instructions.  Of  course,  with  off-chip  address  decoding  support, 
access  to  coprocessor  space  could  always  be  made  available  via  a  region  of  I/O  or 
data  space.  Coprocessors  support  off-chip  extensions  to  a  processor’s  execution 
unit(s).  AMD  supplied  a  coprocessor  in  the  past,  which  was  for  floating-point  sup¬ 
port,  the  Am29027.  It  is  possible  that  users  could  construct  their  own  coprocessor  for 
some  specialized  support  task. 

Earlier  sections  discussed  the  read-only  nature  of  the  instruction  bus  of  3-bus 
processors.  Instructions  are  fetched  along  the  instruction  bus  from  either  the  ROM 
space  or  the  Instruction  space.  Access  to  the  two  32-bit  spaces  is  distinguished  by  the 
IREQT  processor  pin.  The  state  of  this  pin  is  determined  by  the  RE  (ROM  enable)  bit 
of  the  current  processor  status  register  (CPS).  This  bit  can  be  set  by  software  or  via 
programmed  event  actions,  such  as  trap  processing.  ROM  space  is  intended  for  sys¬ 
tem  level  support  code.  Typically  systems  do  not  decode  this  pin  and  the  two  spaces 
are  combined  into  one. 
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The  Input/Output  (I/O)  space  can  be  reached  by  setting  the  AS  (address  space) 
bit  in  load  and  store  instructions.  Transfers  to  I/O  space,  like  coprocessor  space  and 
data  space  transfers,  are  indicated  by  the  appropriate  value  appearing  on  the 
DREQT1-0  (data  request  type)  processor  pins.  I/O  space  access  is  only  convenient 
for  assembly  level  routines.  There  is  typically  no  convenient  way  for  a  high  level  lan¬ 
guage  to  indicate  an  access  is  to  be  performed  to  I/O  space  rather  than  data  space.  For 
this  reason  use  of  I/O  space  is  often  best  avoided,  unless  it  is  restricted  to  accessing 
some  Supervisor  maintained  peripheral  which  is  best  handled  via  assembly  language 
code. 

The  2-bus  29K  family  processors  support  a  reduced  number  of  off-chip  address 
spaces,  in  fact,  only  two:  Input/Output  space,  and  a  combined  Instruction/Data 
memory  space.  Accessing  both  instructions  and  data  via  a  shared  instruction/data  bus 
simplifies  the  memory  system  design.  It  can  also  simplify  the  software;  for  example, 
instruction  space  and  data  space  can  no  longer  overlap.  Consider  a  3-bus  system 
which  has  physical  memory  located  at  address  0x10000  in  instruction  space  and  also 
different  memory  located  at  address  0x10000  in  data  space.  Software  errors  can  oc¬ 
cur  regarding  accessing  the  correct  memory  for  address  0x10000.  It  can  also  compli¬ 
cate  system  tasks  such  as  virtual  memory  management,  where  separate  free-page 
lists  would  have  to  be  kept  for  the  different  types  of  memory. 

The  Translation  Look-Aside  buffer  (TLB),  used  to  support  virtual  memory  ad¬ 
dressing,  supports  separate  enabling  of  data  and  instruction  access  via  the  R/W/X 
(read/write/execute)  enable  bits.  However,  permission  checking  is  only  performed 
after  address  translation  is  performed.  It  is  not  possible  to  have  two  valid  virtual-to- 
physical  address  translations  present  in  the  TLB  at  the  same  time  for  the  same  virtual 
address,  even  if  one  physical  address  is  for  data  space  and  the  other  instruction  space. 
This  complicates  accessing  overlapping  address  spaces  via  a  single  32-bit  virtual 
space. 

Accessing  virtual  memory  has  similar  characteristics  to  accessing  memory  via  a 
high  level  language.  For  example,  C  normally  supports  a  single  address  space.  It  is 
difficult  and  nonportable  to  have  C  code  which  can  reach  different  address  spaces. 
Except  for  instruction  fetching,  all  off-chip  memory  accesses  are  via  load  and  store 
type  instructions.  The  OPT2-0  field  for  these  instructions  specifies  the  size  of  the 
data  being  transferred:  byte,  half-word  or  32-bit.  The  compiler  assigns  OPT  field 
values  for  all  load  and  store  instructions  it  generates.  Unless  via  C  language  exten¬ 
sions  or  assembly  code  post-processing,  there  is  no  way  to  set  the  load  and  store 
instruction  address-space-selecting  options.  Software  is  simplified  by  locating  all 
external  peripherals  and  memory  in  a  single  address  space;  or  when  a  Harvard  archi¬ 
tecture  is  used,  by  not  overlapping  the  regions  of  data  and  instruction  memory  spaces 
used. 
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1.11  INSTRUCTION  FORMAT 


All  instructions  for  the  Am29000  processor  are  32  bits  in  length,  and  are  divided 
into  four  fields,  as  shown  in  Figure  1-22.  These  fields  have  several  alternative  defini¬ 
tions,  as  discussed  below.  In  certain  instructions,  one  or  more  fields  are  not  used,  and 
are  reserved  for  future  use.  Even  though  they  have  no  effect  on  processor  operation, 
bits  in  reserved  fields  should  be  0  to  insure  compatibility  with  future  processor  ver¬ 
sions. 
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Figure  1-22.  Instruction  Format 
The  instruction  fields  are  defined  as  follows: 


BITS  31-24 

Op  This  field  contains  an  operation  code,  defining  the  operation  to  be 

performed.  In  some  instructions,  the  least-significant  bit  of  the  op¬ 
eration  code  selects  between  two  possible  operands.  For  this  reason, 
the  least-significant  bit  is  sometimes  labeled  “A”  or  “M'\  with  the 
following  interpretations: 

A  (Absolute):  The  A  bit  is  used  to  differentiate  between  Program- 

Counter  relative  (A  =  0)  and  absolute  (A  =  1)  instruction  addresses, 
when  these  addresses  appear  within  instructions. 

M  (IMmediate):  The  M  bit  selects  between  a  register  operand  (M  =  0) 

and  an  immediate  operand  (M  =1),  when  the  alternative  is  allowed  by 
an  instruction. 


BITS  23-16 

RC  The  RC  field  contains  a  global  or  local  register-number,  which  is  the 

destination  operand  for  many  instructions. 

I17..I10  This  field  contains  the  most-significant  8  bits  of  a  16-bit  instruction 
address.  This  is  a  word  address,  and  may  be  Program-Counter  rela¬ 
tive  or  absolute,  depending  on  the  A  bit  of  the  operation  code. 
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115. .18 

This  field  contains  the  most-significant  8  bits  of  a  16-bit  instruction 
constant. 

VN 

This  field  contains  an  8-bit  trap  vector  number. 

CE//CNTL 

This  field  controls  a  load  or  store  access. 

BITS  15-8 

RA 

The  RA  field  contains  a  global  or  local  register-number,  which  is  a 
source  operand  for  many  instructions. 

SA 

The  SA  field  contains  a  special-purpose  register-number. 

BITS  7-0 

RB 

The  RB  field  contains  a  global  or  local  register-number,  which  is  a 
source  operand  for  many  instructions. 

RB  or  I 

This  field  contains  either  a  global  or  local  register-number,  or  an 
8-bit  instruction  constant,  depending  on  the  value  of  the  M  bit  of  the 
operation  code. 

19.. 12 

This  field  contains  the  least-significant  8  bits  of  a  16-bit  instruction 
address.  This  is  a  word  address,  and  may  be  Program-Counter  rela¬ 
tive,  or  absolute,  depending  on  the  A  bit  of  the  operation  code. 

17.. 10 

This  field  contains  the  least-significant  8  bits  of  a  16-bit  instruction 
constant. 

UI//RND//FD//FS 

This  field  controls  the  operation  of  the  CONVERT  instruction. 

The  fields  described  above  may  appear  in  many  combinations.  However,  cer¬ 
tain  combinations  which  appear  frequently  are  shown  in  Figure  1-23. 

1.12  KEEPING  THE  RISC  PIPELINE  BUSY 

If  the  external  interface  of  a  microprocessor  can  not  support  an  instructon  fetch 
rate  of  one  instruction  per  cycle,  execution  rates  of  one  per  cycle  can  not  be  sustained. 
As  described  in  detail  in  Chapter  6,  a  4-1  DRAM  (4-cycle  first  access,  1-cycle  sub¬ 
sequent  burst-mode  access)  memory  system  used  with  a  3-bus  Am29000  processor, 
can  sustain  an  average  processing  time  per  instruction  of  typically  two  cycles,  not  the 
desired  1-cycle  per  instruction.  However,  a  2-1  SRAM  based  system  comes  very 
close  to  this  target.  From  these  example  systems  it  can  be  seen  that  even  if  a  memory 
system  can  support  1-cycle  burst-mode  access,  there  are  other  factors  which  prevent 
the  processor  from  sustaining  single-cycle  execution  rates. 

It  is  important  to  keep  the  processor  pipeline  busy  doing  useful  work.  Pipeline 
stalling  is  a  major  source  of  lost  processor  performance.  Stalling  occurs  as  a  result  of: 
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Three  operands,  with  possible  8-bit  constant: 


31  23  15  7  0 


Three  operands,  without  constant:: 

31  23  15  70 


One  register  operand,  with  16-bit  constant: 

31  23  15  7  0 


Jumps  and  calls  with  16-bit  instruction  address: 

31  23  15  7  0 


Two  operands  with  trap  vector  number: 

31  23  15  7  0 


Loads  and  stores: 
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inadaquate  memory  bandwidth,  high  memory  access  latency,  bus  access  contention, 
excesive  program  branching,  and  instruction  dependancies.  To  get  the  best  from  a 
processor  an  understanding  of  instruction  stream  dependancies  is  required.  Proces¬ 
sors  in  the  29K  familiy  all  have  pipeline  interlocks  supported  by  processor  hardware. 
The  programmer  does  not  have  to  ensure  correct  pipeline  operation,  as  the  processor 
will  take  care  of  any  dependancies.  However,  it  is  best  that  the  programmer  arranges 
code  execution  to  smooth  the  pipeline  operation. 

1.13  PIPELINE  DEPENDENCIES 

Modification  of  some  registers  has  a  delayed  effect  on  processor  behavior. 
When  developing  assembly  code,  care  must  be  taken  to  prevent  unexpected  behav¬ 
ior.  The  easiest  of  the  delayed  effects  to  remember  is  the  one  cycle  that  must  follow 
the  use  of  an  indirect  pointer  after  having  set  it.  This  occurs  most  often  with  the  regis¬ 
ter  stack  pointer.  It  cannot  be  used  to  access  a  local  register  in  the  instruction  that  fol¬ 
lows  the  instruction  that  writes  to  grl.  An  instruction  that  does  not  require  grl  (and 
that  means  all  local  registers  referenced  via  grl )  can  be  placed  immediately  after  the 
instruction  that  updates  grl. 

Direct  modification  of  the  Current  Processor  Status  (CPS)  register  must  also  be 
done  carefully.  Particularly  where  the  Freeze  (FZ)  bit  is  reset.  When  the  processor  is 
frozen,  the  special-purpose  registers  are  not  updated  during  instruction  execution. 
This  means  that  the  PCI  register  does  not  reflect  the  actual  program  counter  value  at 
the  current  execution  address,  but  rather  at  the  point  where  freeze  mode  was  entered. 
When  the  processor  is  unfrozen,  either  by  an  interrupt  return  or  direct  modification  of 
the  CPS,  two  cycles  are  required  before  the  PCI  register  reflects  the  new  execution 
address.  Unless  the  CPS  register  is  being  modified  directly,  this  creates  no  problem. 

Consider  the  following  examples.  If  the  FZ  bit  is  reset  and  trace  enable  (TE)  is 
set  at  the  same  time,  the  next  instruction  should  cause  a  trace  trap,  but  the  PC-buffer 
registers  frozen  by  the  trap  will  not  have  had  time  to  catch  up  with  the  current  execu¬ 
tion  address.  Within  the  trap  code  the  processor  will  have  appeared  to  have  stopped  at 
some  random  address,  held  in  PCI.  If  interrupts  and  traps  are  enabled  at  the  same 
time  as  the  FZ  bit  is  cleared,  then  the  next  instruction  may  suffer  an  external  interrupt 
or  an  illegal  instruction  trap.  Once  again,  the  PC-buffer  register  will  not  reflect  the 
true  execution  address.  An  interrupt  return  would  cause  execution  to  commence  at  a 
random  address.  The  above  problems  can  be  avoided  by  resetting  FZ  two  cycles  be¬ 
fore  enabling  the  processor  to  once  again  enter  freeze  mode. 

Instruction  Memory  Latency 

The  Branch  Target  Cache  (BTC),  or  the  Instruction  Memory  Cache,  can  be  used 
to  remove  the  pipeline  stalling  that  normally  occurs  when  the  processor  executes  a 
branch  instruction.  For  the  purpose  of  illustrating  memory  access  latency,  the  effects 
of  the  BTC  shall  be  illustrated.  The  address  of  a  branch  target  appears  on  the  address 
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pins  at  the  start  of  the  write-back  stage.  Figure  1-24  shows  the  instruction  flow 
through  the  pipeline  stages,  assuming  the  external  instruction  memory  returns  the 
target  of  a  jump  during  the  same  cycle  in  which  it  was  requested.  This  makes  the  Tar¬ 
get  instruction  available  at  the  fetch  stage  while  the  Delay  instruction  has  to  be  stalled 
before  it  can  enter  the  execute  stage.  In  this  case,  execution  is  stalled  for  two  cycles 
when  the  BTC  is  not  used  to  supply  the  target  instruction. 
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Figure  1-24.  Pipeline  Stages  for  BTC  Miss 


The  address  of  the  fetch  is  presented  to  the  BTC  hardware  during  the  execute 
stage  of  the  jump  instruction,  the  same  time  the  address  is  presented  to  the  memory 
management  unit.  When  a  hit  occurs,  the  target  instruction  is  presented  to  the  decode 
stage  at  the  next  cycle.  This  means  no  pipeline  stalling  occurs.  The  external  instruc¬ 
tion  memory  has  up  to  three  cycles  to  return  the  instruction  four  words  past  the  target 
address.  That  is,  if  single-cycle  burst-mode  can  be  established  in  three  cycles  (four 
cycles  for  the  Am29050  processor)  or  less,  then  continuous  execution  can  be 
achieved.  The  BTC  supplies  the  target  instructions  and  the  following  three  instruc¬ 
tions,  assuming  another  jump  is  not  taken.  Figure  1-25  shows  the  flow  of  instruc¬ 
tions  through  the  pipeline  stages. 

Data  Dependencies 

Instructions  that  require  the  result  of  a  load  should  not  be  placed  immediately 
after  the  load  instruction.  The  Am29000  processor  can  overlap  load  instructions  with 
other  instructions  that  do  not  depend  on  the  result  of  the  load.  If  4-cycle  data  memory 
is  in  use,  then  external  data  loads  should  (if  possible)  have  four  instructions 
(4-cycles)  between  the  load  instructions  and  the  first  use  of  the  data.  Instructions  that 
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Figure  1  -25.  Pipeline  Stages  for  a  BTC  Hit 


depend  on  data  whose  loads  have  not  yet  completed,  cause  a  pipeline  stall.  The  stall  is 
minimized  by  forwarding  the  data  to  the  execution  unit  as  soon  as  it  is  available. 

Consider  the  example  of  an  instruction  sequence  shown  in  Figure  1-26.  The 
instruction  at  Load+1  is  dependent  on  the  data  loaded  at  Load.  The  address  of  load 
data  appears  on  the  address  pins  at  the  start  of  the  write-back  stage.  At  this  point, 
instruction  Load+1  has  reached  the  execution  stage  and  is  stalled  until  the  data  is  for¬ 
warded  at  the  start  of  the  next  cycle,  assuming  the  external  data  memory  can  return 
data  within  one  cycle. 
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If  the  instruction  were  not  dependent  on  the  result  of  the  load,  it  would  have 
executed  without  delay.  Because  of  data  forwarding  and  a  1 -cycle  data  memory,  the 
load  data  would  be  available  for  instruction  Load+2  without  causing  a  pipeline  stall. 


1.14  ARCHITECTURAL  SIMULATION,  sim29 

AMD  has  for  a  long  time  made  available  a  29K  simulator  which  accurately 
models  the  processor  operation.  This  simulator,  known  as  the  Architectural  Simula¬ 
tor,  can  be  configured  to  incorporate  memory  system  characteristics.  Since  memory 
system  performance  can  greatly  influence  overall  system  performance,  the  use  of  the 
simulator  before  making  design  decisions  is  highly  recommended. 

Simulation  of  all  the  29K  family  members  is  supported,  making  the  simulator 
useful  in  determining  processor  choice  [AMD  1991c][AMD  1993c].  For  example, 
does  a  floating-point  intensive  application  require  an  Am29050  or  will  an  Am29000 
suffice?  Alternatively,  the  performance  penalties  of  connecting  the  data  and  instruc¬ 
tion  busses  together  on  a  3-bus  Harvard  Architecture  processor  can  be  determined. 

Because  the  simulator  models  detailed  processor  operation,  such  as  pipeline 
stages,  cache  memory,  instruction  prefetch,  channel  operation  and  much  more,  the 
simulation  run-times  are  longer  than  if  the  Instruction  Set  Simulator  (ISS)  were  used. 
Consequently,  the  Architectural  Simulator  is  seldom  used  for  program  debugging. 
The  ISS  simulator  is  described  in  Chapter  7  ( Software  Debugging).  This  is  one  of  the 
reasons  that  the  Architectural  simulator  does  not  utilize  the  Universal  Debugger  In¬ 
terface  (see  section  7.5).  Without  a  UDI  interface,  the  simulator  can  not  support  inter¬ 
active  debugging.  Simulation  results  are  directed  to  a  log  file.  Interpretating  their 
meaning  and  dealing  with  log  file  format  takes  a  little  practice;  more  on  this  later. 

When  used  with  a  HIF  conforming  operating  system,  the  standard  input  and  out¬ 
put  for  the  simulated  program  use  the  standard  input  and  output  for  the  executable 
simulator.  Additionally,  the  29K  program  standard  output  is  also  written  to  the  simu¬ 
lation  log  file.  AMD  does  not  supply  the  simulator  in  source  form;  it  is  available  in 
binary  for  UNIX  type  hosts  and  386  based  PCs.  The  simulator  driver,  sim29,  sup¬ 
ports  several  command  line  options,  as  shown  below.  AMD  updated  the  simulator 
after  version  1.1-8;  the  new  version  is  compatible  with  the  old  and  simulates  at  more 
than  four  times  the  speed.  The  old  simulator  is  still  used  with  the  Am29000  and 
Am29050  processors.  Only  the  new  simulator  models  the  Am2924x  microcontrol¬ 
lers  and  newer  2-bus  processors.  The  following  description  of  command  line  options 
covers  both  simulator  versions. 

sim29  [-29000  I  -29005  I  -29030  I  -29035  |  -29050  ...  -29240] 
[-cfg=xx]  [ — d ]  [-e  eventfile]  [-f  freq]  [-h  heapsize]  [ — L ]  [-n] 

[  — o  outputfile]  [-p  from-to]  [-r  osboot]  [-t  max_sys_calls] 

[-u]  [-v]  [-x[codes] ]  [-dcacheoff]  [-icacheoff]  [-dynmem  <val>] 

execfile  [ . . .  optional  args  for  executable] 
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OPTIONS 

-2  9000|2  9005|2  9030 12  9035 | 29040 | 29050 | 29200 | 29205  |  29240  |  ..  . 

Select  29K  processor,  default  is  Am29000.  Depending  on  the  proces¬ 
sor  selected,  the  old  or  new  simulator  is  selected. 

Normally  the  simulator  starts  execution  at  address  0,  with  the  proces¬ 
sor  Configuration  Register  (CFG)  set  to  the  hardware  default  value. 
Its  the  application  code  or  the  osboot  code  responsibility  to  modify 
the  CFG  registers  as  necessary.  Alternatively,  the  CFG  register  can  be 
initialized  from  the  command  line.  The  -cfg  option  specifies  the  set¬ 
ting  for  CFG,  where  xris  a  1  to  5  digit  HEX  number.  If  the  -cfg  option 
is  used,  no  run-time  change  to  CFG  will  take  effect,  unless  an 
Am292xx  processor  is  in  use.  The  -cfg  option  is  seldom  used;  it 
should  be  used  where  an  osboot  file  is  not  supplied  with  the  -r  option. 
Alternatively  it  can  be  used  to  override  the  cache  enable/disable  op¬ 
eration  of  osboot  code.  This  can  enable  the  effects  of  cache  to  be  de¬ 
termined  without  the  need  to  built  a  new  osboot  file.  The  -cfg  option 
is  not  supported  by  the  newer  simulator.  Caches  can  be  disabled  using 
the  new  -icacheoff  and  -dcacheoff  options. 

This  option  instructs  the  simulator  to  report  the  contents  of  processor 
registers  in  the  logfile  at  end  of  simulation. 

This  option  is  only  available  with  the  newer  simulator.  When  used  it 
causes  the  Configuration  Register  (CFG)  to  be  set  for  data  cache  dis¬ 
able. 

-dynmem  <val> 

During  execution  a  program  may  access  a  memory  region  out  with 
any  loaded  memory  segment  or  heap  and  stack  region.  The  simulator 
can  be  instructed  to  automatically  allocate  (val=  1)  memory  for  the  ac¬ 
cessed  region.  Alternatively  (default,  val= 0)  an  access  violation  is  re¬ 
ported. 

-e  eventfile  An  event  file  is  almost  always  used.  It  enables  memory  system  char¬ 
acteristics  to  be  defined  and  the  simulation  to  be  controlled  (see  sec¬ 
tion  1.14.1). 

-f  frequency  Specify  CPU  frequency  in  MHz;  the  default  for  the  Am292xx  and 
Am29035  is  16  MHz;  the  Am2900x  default  is  25  MHz;  and  the  de¬ 
fault  frequency  for  the  Am29030  and  Am29050  is  40  MHz. 

-h  heap  size  This  option  specifies  the  amount  of  resource  memory  available  to  the 
simulated  29K  system.  This  memory  is  used  for  the  register  stack  and 
memory  stack  support  as  well  as  the  run-time  heap.  The  default  size 
is  32  K  bytes;  a  heapsize  value  of  32. 


-cf g=xx 


-d 


-dcacheoff 
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-icacheof  f  This  option  is  only  available  with  the  newer  simulator.  When  used  it 

causes  the  Configuration  Register  (CFG)  to  be  set  for  instruction 
cache  disable. 

-l  This  option  is  similar  in  nature  to  the  -cfg  option.  It  can  be  used  to  se¬ 

lect  the  large  memory  model  for  the  Am292xx  memory  banks.  Nor¬ 
mally  this  selection  is  performed  in  the  osboot  file.  However,  the  -L 
option  can  be  used  to  override  the  osboot  settings,  without  having  to 
build  a  new  osboot  file.  This  option  is  currently  not  supported  in  the 
newer  simulator. 

-n  Normally  the  simulator  will  allow  access  to  the  two  words  following 

the  end  of  a  data  section,  without  generating  an  access  violation. 
Some  of  the  support  library  routines,  such  as  strcpy(),  used  by  29K 
application  code,  use  a  read-ahead  technique  to  improve  perfor¬ 
mance.  If  the  read-ahead  option  is  not  supported,  then  the  -n  option 
should  be  used.  Only  the  older  simulator  supports  this  option.  The 
newer  simulator  always  allows  access  to  the  words  just  past  the  end  of 
the  data  section. 

-o  outputf iie  The  simulator  normally  presents  simulation  results  in  file  sim.out. 

However  an  alternative  result  file  can  be  selected  with  this  option. 

-p  from-to  The  simulator  normally  produces  results  of  a  general  nature,  such  as 
average  number  of  instructions  per  second.  It  is  possible,  using  this 
option  to  examine  the  operation  of  specific  code  sequences  within  ad¬ 
dress  rang e.  from  to  to. 

-r  osboot  The  simulator  can  load  two  29K  executable  programs  via  command¬ 

line  direction:  osboot  and  program.  It  is  normal  to  load  an  operating 
system  to  deal  with  application  support  services;  this  is  accomplished 
with  osboot.  It  is  sometimes  referred  to  as  the  romfile,  because  when 
used  with  29K  family  members  which  support  separate  ROM  and 
Instruction  spaces,  osboot  is  loaded  into  ROM  space.  AMD  supplies 
a  HIF  conforming  operating  system  called  OS-boot  which  is  general¬ 
ly  used  with  the  -r  option.  Your  simulation  tool  installation  should 
have  a  29K  executable  file  called  osboot,  romboot  or  even  pumaboot 
which  contains  the  OS-boot  code.  Care  should  be  taken  to  identify 
and  use  the  correct  file.  The  newer  simulator  will  automatically  select 
a  default  osboot  file  from  the  library  directory  if  the  -r  option  is  not 
used. 

-t  max_sys_calls 

Specify  maximum  number  of  system  call  types  that  will  be  used  dur¬ 
ing  simulation  This  switch  controls  the  internal  management  of  the 
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simulator;  it  is  seldom  used  and  has  a  default  value  of  256.  This  option 
is  not  supported  by  the  newer  simulator. 

-u  The  Am292xx  microcontroller  family  members  have  built-in  ROM 

and  DRAM  memory  controllers.  Programmable  registers  are  used  to 
configure  the  ROM  and  DRAM  region  controllers.  If  the  -u  option  is 
used,  application  code  in  file  program  can  modify  the  controller  set¬ 
tings,  otherwise  only  code  in  osboot  can  effect  changes.  This  protects 
application  code  from  accidentally  changing  the  memory  region  con¬ 
figuration. 

-v  The  OS-boot  operating  system,  normally  used  to  implement  the  os¬ 

boot  file,  can  modify  its  warm-start  operation  depending  on  the  value 
in  register  grl04  (see  section  7.4).  The  -v  switch  causes  grl04  to  be 
initialized  to  0.  When  OS-boot  is  configured  to  operate  with  or  with¬ 
out  MMU  support,  a  run-time  grl04  value  of  0  will  turn  off  MMU 
use. 

-x  [code]  If  a  29K  error  condition  occurs  during  simulation,  execution  is  not 
stopped.  The  -x  option  can  be  used  to  cause  execution  to  stop  under  a 
selected  range  of  error  conditions.  Note,  the  option  is  not  supported 
by  the  newer  simulator.  Each  error  condition  is  given  a  code  letter.  If  - 
x  is  used  with  no  selected  codes,  then  all  the  available  codes  are  as¬ 
sumed  active.  Supported  code  are: 

a  Address  error;  data  or  instruction  address  out  of  bounds. 
k  Kernel  error;  illegal  operation  in  Supervisor  mode, 
o  Illegal  opcode  encountered. 

f  Floating-point  exception  occurred;  such  as  divide  by  zero, 
p  A  protection  violation  occurred  in  User  mode 
s  An  event  file  error  detected. 

execf ile  Name  of  the  executable  program  to  be  loaded  into  memory;  followed 

by  any  command-line  arguments  for  the  29K  executable.  It  is  impor¬ 
tant  that  the  program  be  correctly  linked  for  the  intended  memory  sys¬ 
tem.  This  is  particularly  true  for  systems  based  on  Am292xx  proces¬ 
sors.  They  have  ROM  and  DRAM  regions  which  can  have  very  dif¬ 
ferent  memory  access  performance.  If  SRAM  devices  are  to  be  used 
in  the  ROM  region,  it  is  important  that  the  application  be  linked  for 
the  ROM  region  use  rather  than  the  DRAM. 

It  is  best  to  run  sim29  with  the  -r  osboot  option  (this  is  the  default  operation  with 
the  newer  simulator).  This  is  sometimes  called  cold-start  operation.  The  osboot  pro¬ 
gram  must  perform  processor  initialization,  bringing  the  processor  into  what  is 
known  as  the  warm-start  condition.  At  this  point,  execution  of  the  loaded  program 
commences.  It  is  possible  to  run  the  older  simulator  without  the  use  of  an  osboot  file; 
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this  is  known  as  warm-start  simulation.  When  this  is  done  the  simulator  initializes 
the  processor  special  registers  CFG  and  CPS  to  a  predefined  warm-start  condition. 
AMD  documentation  explains  the  chosen  settings;  they  are  different  for  each  proces¬ 
sor.  Basically,  the  processor  is  prepared  to  run  in  User  mode  with  traps  and  intermpts 
enabled  and  cache  in  use. 

To  support  osboot  operation,  the  simulator  prepares  processor  registers  before 
osboot  operation  starts  (see  Figure  1-27). 
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Figure  1-27.  Register  Initialization  Performed  by  sim29 


The  initial  register  information  is  extracted  from  the  program  file.  Via  the  regis¬ 
ter  data,  the  osboot  code  obtains  the  start  address  of  the  program  code.  If  osboot  code 
is  not  used  (no  —r  command-line  switch  when  using  the  older  simulator),  the  29K 
Program  Counter  is  initialized  to  the  start  address  of  program  code,  rather  than  ad¬ 
dress  0.  To  support  direct  entry  into  warm-start  code,  the  program  argument  in¬ 
formation  is  duplicated  in  Ir2  and  lr3.  Normally  this  information  is  obtained  by  os¬ 
boot  using  the  data  structure  pointed  to  by  grl03. 

The  simulator  intercepts  a  number  of  HIF  service  calls  (see  section  2.2).  These 
services  mainly  relate  to  operating  system  functions  which  are  not  simulated,  but 
dealt  with  directly  by  the  simulator.  All  HIF  services  with  identification  numbers  be¬ 
low  256  are  intercepted.  Additionally  service  305,  for  querying  the  CPU  frequency, 
is  intercepted.  Operating  services  which  are  not  intercepted,  must  be  dealt  with  by  the 
osboot  code.  The  simulator  will  intercept  a  number  of  traps  if  the  -x[ codes]  com¬ 
mand  line  option  is  used;  otherwise  all  traps  are  directed  to  osboot  support  code,  or 
any  other  trapware  installed  during  29K  run-time. 
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1.14.1  The  Simulation  Event  File 


Simulation  is  driven  by  modeling  the  29K  processor  pipeline  operation.  Instruc¬ 
tions  are  fetched  from  memory,  and  make  their  way  through  the  decode,  execute  and 
write-back  stages  of  the  four-stage  pipeline.  Accurate  modeling  of  processor  inter¬ 
nals  enables  the  simulator  to  faithfully  represent  the  operation  of  real  hardware. 

The  simulator  can  also  be  driven  from  an  event  file.  This  file  contains  com¬ 
mands  which  are  to  be  performed  at  specified  time  values.  All  times  are  given  in  pro¬ 
cessor  cycles,  with  simulation  starting  at  cycle  0.  The  simulator  examines  the  event 
file  and  performs  the  requested  command  at  the  indicated  cycle  time. 

The  syntax  of  the  command  file  is  very  simple;  each  command  is  entered  on  a 
single  line  preceded  with  a  integer  cycle-time  value.  There  are  about  15  to  20  differ¬ 
ent  commands;  most  of  them  enable  extra  information  to  be  placed  in  the  simulation 
results  file.  Information  such  as  recording  register  value  changes,  displaying  cache 
memory  contents,  monitoring  floating-point  unit  operation,  and  much  more.  A  se¬ 
cond  group  of  commands  are  mainly  used  with  microcontroller  29K  family  mem¬ 
bers.  They  enable  the  on-chip  peripheral  devices  to  be  incorporated  in  the  simula¬ 
tion.  For  example,  the  Am29200  parallel  port  can  receive  and  transmit  data  from  files 
representing  off-chip  hardware. 

In  practice,  most  of  these  commands  are  little  used;  with  one  exception,  the  SET 
command  (see  note  below).  Most  users  of  sim29  simply  wish  to  determine  how  a 
code  sequence,  representative  of  their  application  code,  will  perform  on  different 
29K  family  members  with  varying  memory  system  configurations.  The  SET  com¬ 
mand  is  used  to  configure  simulation  parameters  and  define  the  characteristics  of 
system  memory  and  buss  arrangements.  I  will  only  describe  the  parameters  used  with 
the  MEM  option  to  the  SET  command. The  cycle-time  value  used  with  the  com¬ 
mands  of  interest  is  zero,  as  the  memory  system  characteristics  are  established  before 
simulation  commences.  One  other  option  to  the  SET  command  of  interest  is 
SHARED_ID_BUS;  when  used,  it  indicates  the  Instruction  and  Data  buses  are  con¬ 
nected  together.  This  option  only  makes  sense  with  3-bus  members  of  the  29K  fami¬ 
ly.  All  the  2-bus  members  already  share  a  single  bus  for  data  and  instructions,  the 
second  bus  being  used  for  address  values.  The  syntax  for  the  commands  of  interest  is 
show  below: 


0  SET_S  H  ARED_ID_B  U  S 

0  SET  MEM  access  TO  value 

Note,  the  SET  command  is  accepted  by  both  the  older  and  newer  versions  of  the 
simulator.  However,  the  newer  version  has  an  abbreviation  to  the  SET  command 
shown  below;  the  “SET  MEM”  syntax  is  replaced  by  a  direct  command  and  there  is 
no  need  for  the  “TO”. 


Chapter  1  Architectural  Overview 


75 


0  SET  MEM  IWIDTH  TO  32 

0  ROMWIDTH  32 

romwidth  32 


older  syntax 
newer  syntax 
newer  syntax 


Am29000  and  Am29050 

Note,  when  the  Instruction  bus  and  Data  busses  are  tied  together  with  3-bus  pro¬ 
cessors,  the  ROM  space  is  still  decoded  separately  from  the  Instruction  space.  Tying 
the  busses  together  will  reduce  system  performance,  because  instructions  can  no 
longer  be  fetched  from  Instruction  space,  or  ROM  space,  while  the  Data  bus  is  being 
used. 

Considering  only  the  most  popular  event  file  commands  simplifies  the  presenta¬ 
tion  of  sim29  operation;  and  encourages  its  use.  Those  wishing  to  know  more  about 
event  file  command  options  should  contact  AMD.  They  readily  distribute  the  sim29 
executable  software  for  popular  platforms  and  with  relevant  documentation. 

Table  1-5  shows  the  allowed  access  and  value  parameters  for  3-bus  members  of 
the  29K  family,  that  is,  the  Am29000  and  Am29050  processors.  Off-chip  memory 
can  exist  in  three  separately  addressed  spaces:  Instruction,  ROM ,  and  Data.  Memory 
address-decode  and  access  times  (in  cycles)  must  be  entered  for  each  address  space 
which  will  be  accessed  by  the  processor;  default  values  are  provided. 


Table  1  -5.  3-bus  Processor  Memory  Modeling  Parameters  for  sim29 


Instruction 

ROM 

Data 

Value  Default 

Operation 

IDECODE 

RDECODE 

DDECODE 

0-n 

0 

Decode  address 

IACCESS 

RACCESS 

DRCCESS 

1-n 

1 

First  read 

DWACCESS 

1-n 

1 

First  write 

IBURST 

RBURST 

DBURST 

TIF 

false 

Burst-mode  supported 

IBACCESS 

RBACCESS 

DBRACCESS 

1-n 

1 

Burst  read 

DBWACCESS 

1-n 

1 

Burst  write 

If  a  memory  system  supports  burst  mode,  the  appropriate  *BURST  access  pa¬ 
rameter  must  be  set  to  value  TRUE.  The  example  below  sets  Instruction  memory  ac¬ 
cesses  to  two  cycles;  subsequent  burst  mode  accesses  are  single-cycle.  The  example 
commands  only  affect  Instruction  memory;  additional  commands  are  required  to  es¬ 
tablish  Data  memory  access  characteristics.  Many  users  of  the  simulator  only  require 
memory  modeling  parameters  from  Table  1-5,  even  if  DRAM  is  in  use. 

0  SET  MEM  IACCESS  TO  2 
0  SET  MEM  IBURST  TO  true 

0  SET  MEM  IBACCESS  TO  1 

If  DRAM  memory  devices  are  used,  there  are  several  additional  access  parame¬ 
ters  which  can  be  used  to  support  memory  system  modeling  (see  Table  1-6).  DRAM 
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devices  are  indicated  by  the  *  PAGEMODE  parameter  being  set.  The  29K  family  in¬ 
ternally  operates  with  a  page  size  of  256  words,  external  DRAM  memory  always  op¬ 
erates  with  integer  multiples  of  this  value.  For  this  reason,  there  is  never  any  need  to 
change  the  *PGSIZE  parameter  setting  from  its  default  value.  The  first  read  access  to 
DRAM  memory  takes  *PFACCESS  cycles;  second  and  subsequent  read  accesses 
take  *PS  ACCESS  cycles.  However,  if  the  memory  system  supports  burst  mode,  sub¬ 
sequent  read  accesses  take  *PB ACCESS  cycles  rather  than  *  PS  ACCESS. 

If  static  column  DRAM  memories  are  used,  then  memory  devices  do  not  require 
CAS  signals  between  same-page  accesses.  Static  column  memory  use  is  indicated  by 
the  *STATCOL  parameter.  Initial  page  accesses  suffer  the  additional  *PRECHAR- 
GE  access  penalties,  subsequent  accesses  all  have  same  access  latencies.  Note,  burst 
mode  access  can  also  apply  to  static  column  DRAM  memory.  Table  1-7  shows 
memory  modeling  parameters  for  static  column  memories. 


Table  1-6.  3-bus  Processor  DRAM  Modeling  Parameters  for  sim29  ( continued ) 


Instruction 

ROM 

Data 

Value  Default 

Operation 

IPAGEMODE 

PAGEMODE 

DPAGEMODE 

TIF 

false 

Memory  is  paged 

IPGSIZE 

RPGSIZE 

DPGSIZE 

1-n 

256 

Page  size  in  words 

IPFACCESS 

RPFACCESS 

DPFRACCESS 

1-n 

1 

First  read  in  page  mode 

DPFWACCESS 

1-n 

1 

First  write  in  page  mode 

IPSACCESS 

RPSACCESS 

DPSRACCESS 

1-n 

1 

Secondary  read  within  page 

DPSWACCESS 

1-n 

1 

Secondary  write  within  page 

IPBACCESS 

RPBACCESS 

DPBRACCESS 

1-n 

1 

Burst  read  within  page 

DPBWACCESS 

1-n 

1 

Burst  write  within  page 

Table  1-7.  3-bus  Processor  Static  Column  Modeling  Parameters  for  sim29  ( continued ) 


Instruction 

ROM 

Data 

Value  Default 

Operation 

ISTATCOL 

ISMASK 

IPRECHARGE 

ISACCESS 

RSTATCOL 

RSMASK 

RPRECH 

RSACCESS 

DSTATCOL 

DSMASK 

DPRECHARGE 

DSRACCESS 

DSWACCESS 

TIF  false 
OxffffffO  0 
0-n  0 

1-n  1 

1-n  1 

Static  column  memory  used 
Column  address  mask,  64— words 
Precharge  on  page  crossing 

Read  access  within  static  column 
Write  access  within  static  column 

Separate  regions  of  an  address  space  may  contain  more  than  one  type  of 
memory  device  and  control  mechanism.  To  support  this,  memory  banking  is  pro¬ 
vided  for  in  the  simulator  (see  Table  1-8).  The  [IIRIDJBANKSTART  parameter  is 
used  to  specify  the  start  address  of  a  memory  bank;  a  bank  is  a  contiguous  region  of 
memory  of  selectable  size,  within  an  indicated  address  space.  Once  the  *BANK- 
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START  command  has  been  used,  all  following  commands  relate  to  the  current  bank, 
until  a  new  bank  is  selected.  This  type  of  command  is  more  frequently  used  with  mi¬ 
crocontroller  members  of  the  29K  family. 


Table  1-8.  3-bus  Processor  Memory  Modeling  Parameters  for  sim29  ( continued ) 


Instruction 

ROM 

Data 

Value  Default 

Operation 

IBANKSTART 

IBANKSIZE 

RBANK 

BBAKSIZE 

DBANKSTART 

DBANKSIZE 

0-n  - 

1-n  1 

Start  address  of  memory  region 
Size  in  bytes  of  memory  region 

Am29030  and  Am29035 

The  parameters  used  with  the  SET  command,  when  simulating  2-bus  29K  fami¬ 
ly  members  are  a  little  different  from  3-bus  parameters  (see  Table  1-9).  The  parame¬ 
ters  shown  are  for  the  older  simulator,  but  they  are  accepted  by  the  new  simulator.  For 
a  list  of  alternative  parameters,  which  are  only  accepted  by  the  newer  simulator,  see 
the  following  Am29040  section.  There  is  no  longer  a  ROM  space,  and  although 
instructions  and  data  can  be  mixed  in  the  same  memory  devices,  separate  modeling 
parameters  are  provided  for  instruction  and  data  accesses. 


Table  1-9.  2-bus  Processor  Memory  Modeling  Parameters  for  older  sim29 


Instruction 

Data 

Value  Default 

Operation 

IACCESS 

DRACCESS 

2-n 

2 

First  read  from  SRAM 

DWACCESS 

2-n 

2 

First  write  from  SRAM 

IBURST 

DBURST 

TIF 

true 

Burst-mode  supported 

IBACCESS 

DBRACCESS 

1-n 

1 

Burst  read  within  page 

DBWACCESS 

1-n 

1 

Burst  write  within  page 

IWIDTH 

DWIDTH 

8,16,32  32 

Memory  width 

IPRECHARGE 

DPRECHARGE 

0-n 

0 

Precharge  on  page  crossing 

IPACCESS 

DPRACCESS 

2-n 

2 

First  access  in  page  mode 

DPWACCESS 

2-n 

2 

First  write  in  page  mode 

IBANKSTART 

DBANKSTART 

0-n 

- 

Start  address  of  memory  region 

IBANKSIZE 

DBANKSIZE 

1-n 

1 

Size  in  bytes  of  memory  region 

HALFSPEED 

HALFSPEED 

TIF 

false 

Memory  system  is  1/2  CPU  speed 

Consider  accessing  memory  for  instructions;  IACCESS  gives  the  access  time, 
unless  DRAM  is  used,  in  such  case,  access  time  is  given  by  IPACCESS.  The  use  of 
DRAM  is  indicated  by  the  *  PRECHARGE  parameter  value  being  non  zero.  First  ac¬ 
cesses  to  DRAM  pages  suffer  an  addition  access  delay  of  *PRECHARGE.  If  burst 
mode  is  supported,  with  all  memory  device  types,  the  access  times  for  instruction 
memory,  other  than  the  first  access,  is  given  by  IB  ACCESS. 
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Both  the  current  2-bus  29K  family  members  support  Scalable  Clocking,  enab¬ 
ling  a  half  speed  external  memory  system.  They  also  support  narrow,  8-bit  or  16-bit, 
memory  reads.  The  Am29035  processor  also  supports  dynamic  bus  sizing.  All  exter¬ 
nal  memory  accesses  can  be  1 6-bit  or  32-bit;  processor  hardware  takes  care  of  multi¬ 
ple  memory  accesses  when  operating  on  32-bit  data.  As  with  the  3-bus  29K  family 
members,  the  simulator  provides  for  memory  banking.  This  enables  different 
memory  devices  to  be  modeled  within  specified  address  ranges. 

Alternative  Am29030,  Am29035  and  Am29040 

As  stated  in  the  previous  section,  the  newer  sim29  can  accept  the  memory  mod¬ 
eling  parameters  used  by  the  older  sim29.  However,  the  newer  simulator  can  operate 
with  alternative  modelling  commands;  these  are  shown  on  Table  1-10.  Commands 
can  be  in  upper  or  lower  case,  but  they  are  shown  here  in  lower  case.  A  list  of  avail¬ 
able  simulator  commands  can  be  had  by  issuing  the  command  “sim29  -29040 
-help”.  An  example  of  Am29040  processor  simulation  can  be  found  in  section  8.1.3 


Table  1-10.  2-bus  Processor  Memory  Modeling  Parameters  for  newer  sim29 


Command 

value 

Operation 

rombank 

<adds>  <size> 

Size  and  address  of  ROM/SRAM 

rambank 

<adds>  <size> 

Size  and  address  of  DRAM 

halfspeedbus 

truelfalse 

Scalable  Clocking  (default=false) 

logging 

truelfalse 

Loging  to  file  sip. log  (default=false) 

ROM/SRAM 

DRAM 

Value  Default 

Operation 

romread 

ramread 

2-n 

2 

First  read 

romwrite 

ramwrite 

2-n 

2 

First  write 

romburst 

ramburst 

TIF 

true 

Enable  burst  mode  addressing 

rombread 

rambread 

1-n 

1 

Burst  read  within  page 

rombwrite 

rambwrite 

1-n 

1 

Burst  write  within  page 

rompage 

rampage 

TIF 

true 

Enable  page  mode 

rompread 

rampread 

2-n 

2 

Single  read  within  page 

rompwrite 

rampwrite 

2-n 

2 

Single  write  within  page 

rompwidth 

ramwidth 

16,32 

32 

Bit  width  of  memory 

ramprecharge 

0-n 

0 

DRAM  precharge  time 

rampprecharge 

0-n 

0 

Page  mode  DRAM  prechage 

ramrefrate 

0-n 

0 

DRAM  refresh  rate  (0=off) 

ROM  and  SRAM  memory  types  are  modeled  with  the  same  set  of  commands. 
The  simulator  allocates  a  default  ROM/SRAM  memory  bank  starting  at  address  0x0. 
Unless  a  RAMB  ANK  command  is  used  to  allocate  a  DRAM  memory  section  at  a  low 
memory  address,  all  code  and  data  linked  for  low  memory  addresses  will  be  allocated 
to  the  default  ROM/SRAM  memory  bank. 
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DRAM  memory  is  modelled  with  the  RAM*  modelling  commands.  A  default 
DRAM  memory  section  is  established  at  address  0x4000,0000.  Unless  a 
ROMBANK  command  is  used  to  allocate  a  ROM/SRAM  memory  bank  at  this 
address  range,  all  accesses  to  high  memory  will  be  satisfied  by  the  default  DRAM 
memory. 

The  default  linker  command  files  used  with  the  High  C  29K  tool  chain,  typically 
links  programs  for  execution  according  the  the  above  default  memory  regions.  How¬ 
ever,  older  release  of  the  compiler  tool  chain  (or  other  tool  chains)  may  link  for  differ¬ 
ent  memory  models.  This  would  require  the  use  of  RAMBANK-type  commands  to 
establish  the  correct  memory  model.  Alternatively,  a  compiler  command  file  could 
be  used  to  ensure  a  program  is  linked  for  the  default  simulator  memory  mode  (see 
section  2.3.6). 

Am29200  and  Am29205 

The  simulator  does  not  maintain  different  memory  access  parameters  for 
instruction  and  data  access  when  modeling  microcontroller  members  of  the  29K 
family.  However,  it  does  support  separate  memory  modeling  parameters  for  DRAM 
and  ROM  address  regions  (see  Table  1-11).  Each  of  these  two  memory  regions  has  its 
own  memory  controller  supporting  up  to  four  banks.  A  bank  is  a  contiguous  range  of 
memory  within  the  address  range  accessed  via  the  region  controller.  The  DRAM  re¬ 
gion  controller  is  a  little  more  complicated  than  the  ROM  region  controller.  The  pa¬ 
rameters  shown  in  Table  1-11  are  for  the  older  simulator,  but  they  are  accepted  by  the 
new  simulator.  For  a  list  of  alternative  parameters,  which  are  only  accepted  by  the 
newer  simulator,  see  the  following  Am29240  section. 

The  DRAM  access  is  fixed  at  four  cycles  ( 1  for  precharge  +  3  for  latency),  it  can 
not  be  programmed.  Subsequent  accesses  to  the  same  page  take  four  cycles  unless 
pagemode  memories  are  supported.  Note  the  first  access  is  only  three  cycles  rather 
than  four,  as  the  RAS  will  already  have  met  the  precharge  time.  Basically,  to  prechar¬ 
ge  the  RAS  bit  lines,  all  RAS  lines  need  to  be  taken  high  in  between  each  change  of  the 
row  addresses.  A  separate  cycle  is  needed  for  precharge  when  back-to-back  DRAM 
accesses  occurs.  Use  of  pagemode  memories  is  indicated  by  the  PAGEMODE  pa¬ 
rameter  being  set;  when  used,  the  processor  need  not  supply  RAS  memory  strobe  sig¬ 
nals  before  page  CAS  strobes  for  same  page  accesses.  This  reduces  subsequent  page 
access  latency  to  three  cycles.  Additionally,  when  pagemode  is  used  and  a  data  burst 
is  attempted  within  a  page,  access  latency  is  two  cycles.  The  DRAM  memory  width 
can  be  set  to  16  or  32-bits.  Of  course  when  an  Am29205  is  used,  all  data  memory 
accesses  are  restricted  by  the  16-bit  width  of  the  processor  data  bus. 

To  explain  further,  access  times  to  DRAM  for  none  pagemode  memories  follow 
the  sequence: 

X,3,4,4,4,X,3,4,4,4,X,X,3,X,3,... 
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Where  X  is  a  non-DRAM  access,  say  to  ROM  or  PIA  space.  For  DRAM  sys¬ 
tems  supporting  pagemode  the  sequence  would  be: 

X,3,2,2,2,<boundary  crossing>,4,2,2,<boundary  crossing>,X,3,2,2,2 

Memory  devices  located  in  ROM  space  can  be  modeled  with  a  wider  range  of 
parameter  values.  Both  SRAM  and  ROM  devices  can  be  modeled  in  ROM  space.  Us¬ 
ing  the  RBANKNUM  parameter,  the  characteristics  of  each  bank  can  be  selectively 
described.  Burst-mode  addressing  is  only  supported  for  instruction  or  data  reading. 
When  the  burst  option  is  used  (RBURST  set  to  TRUE),  read  accesses,  other  than  the 
first  for  a  new  burst,  take  RB ACCESS  cycles  rather  than  the  standard  RRACCESS 
cycles.  Memory  device  widths  can  be  8, 16  or  32-bits.  If  an  Am29205  microcontrol¬ 
ler  is  being  modeled,  memory  accesses  wider  than  the  16-bit  bus  width  always  re¬ 
quire  the  processor  to  perform  multiple  memory  transfers  to  access  the  required 
memory  location. 


Table  1  -11.  Microcontroller  Memory  Modeling  Parameters  for  sim29 


ROM/SRAM 

value  default 

DRAM 

Value 

Defaul 

:  (Am29200)  Operation 

1 

Precharge  on  page  crossing 

RRACCESS 

1-n 

1 

3 

First  read 

RWACCESS 

2-n 

2 

3 

First  write 

RBURST 

TIF 

false 

Burst  address  in  ROM  region 

RB ACCESS 

1-n 

1 

2 

Burst  read  within  page 

2 

Burst  write  within  page 

ROMWIDTH 

8.16,32  32 

DRAMWIDTH 

16,32 

32 

Width  of  memory 

PAGEMODE 

TIF 

false 

Page  mode  supported 

RBANKNUM 

0-3 

- 

DBANKNUM 

0-3 

- 

Select  which  memory  bank 

Preparing  sim29  for  modeling  an  Am29200  system  is  not  difficult.  The  follow¬ 
ing  commands  configure  the  first  two  ROM  banks  to  access  non-burst-mode  memo¬ 
ries  which  are  32-bits  wide,  and  have  a  1-cycle  read  access,  and  a  2-cycle  write  ac¬ 
cess. 
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0  COM  ROM  bank  0  parameters 
0  SET  MEM  rbanknum  to  0 

0  SET  MEM  rraccess  to  1 

0  SET  MEM  rwaccess  to  2 

0  COM  ROM  bank  1  parameters 
0  SET  MEM  rbanknum  to  1 

0  SET  MEM  rraccess  to  1 

0  SET  MEM  rwaccess  to  2 

The  following  DRAM  parameters,  like  the  ROM  parameters  above,  are  correct 
for  modeling  an  SA29200  evaluation  board.  The  first  DRAM  bank  is  configured  to 
support  pagemode  DRAM  access,  giving  access  latencies  of  4:3:2  (4  for  first,  3  for 
same-page  subsequent,  unless  they  are  bursts  which  suffer  only  2-cycle  latency). 

0  COM  DRAM  bank  0  parameters 
0  SET  MEM  dbanknum  to  0 
0  SET  MEM  dpagemode  to  true 


Alternative  Am2920x  and  Am2924x 

As  stated  in  the  previous  section,  the  newer  sim29  can  accept  the  memory  mod¬ 
eling  parameters  used  by  the  older  sim29.  However,  the  newer  simulator  can  operate 
with  alternative  modelling  commands;  these  are  shown  on  Table  1-12.  Commands 
can  be  in  upper  or  lower  case,  but  they  are  shown  here  in  lower  case.  A  list  of  avail¬ 
able  simulator  commands  can  be  had  by  executing  the  command  “sim29  -29240 
-help”.  An  example  of  Am29200  microcontroller  simulation  can  be  found  in  section 
8.1. 
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Table  1  -12.  Microcontroller  Processor  Memory  Modeling  Parameters  for  newer  sim29 


Command 


rombank 

rambank 

halfspeedbus 

logging 

parallelin 

parallelout 

serialin 

serialout 


value  default 


Operation 


<adds>  <size> 
<adds>  <size> 
truelfalse 
truelfalse 
<file>  [<speed>] 
<file>  [<speed>] 
alb  <file>  [  [<baud>] 
alb  <file>  [<baud>] 


Size  and  address  of  ROM/SRAM 
Size  and  address  of  DRAM 
Scalable  Clocking  (default=false) 
Loging  to  file  sip. log  (default=false) 
Parallel  port  input  file 
Parallel  port  output  file 
Serial  port,  a  or  b,  input  file 
Serial  port,  a  or  b,  output  file 


ROM/SRAM 

DRAM 

Value 

Default  (Am29240)  Operation 

romread 

1-n 

1 

First  read 

romwrite 

2-n 

2 

First  write 

romburst 

TIF 

false 

Enable  burst  mode  addressing 

rombread 

1-n 

1 

Burst  read  within  page 

rampage 

TIF 

true 

Enable  page  mode 

rompwidth 

ramwidth 

8,16,32  32 

Bit  width  of  memory 

ramrefrate 

0-n 

255 

DRAM  refresh  rate  (0=off) 

ROM  and  SRAM  memory  types  are  modeled  with  the  same  set  of  commands. 
The  simulator  automaticlay  allocates  ROM/SRAM  memory  bank  0.  Using  the 
ROMBANK  parameter,  the  characteristics  of  each  bank  can  be  selectively  de¬ 
scribed.  The  default  parameters  are  typically  for  a  relatively  fast  memory  system 

The  DRAM  memory  access  times  are  fixed  by  the  processor  specification. 
However,  there  are  some  DRAM  modelling  commands  enabling  selection  of 
memory  system  with  and  pagemode  devices.  The  simulator  automatically  allocates 
DRAM  memory  bank  0  at  address  0x4000,0000.  All  accesses  to  memory  above  this 
address  will  be  satisfied  by  the  DRAM  memory  bank. 

It  is  usually  less  of  a  problem  linking  programs  for  execution  on  a  29K  micro¬ 
controller;  as  the  processor  hardware  dictates,  to  some  extend,  the  allowed  memory 
regions.  The  default  linker  command  files  used  with  the  High  C  29K  tool  chain,  typi¬ 
cally  link  programs  for  execution  according  the  the  processor  specificity  memory  re¬ 
gions.  Compiler  command  files  are  described  in  section  2.3.6. 

1.14.2  Analyzing  the  Simulation  Log  File 

Running  the  architectural  simulator  is  simple  but  rather  slow.  The  inclusion  of 
detail  about  the  processor  pipeline  results  in  slow  simulation  speeds.  For  this  reason, 
users  typically  select  a  portion  of  their  application  code  for  simulation.  This  portion 
is  either  representative  of  the  overall  code  or  subsections  whose  operation  is  critical 
to  overall  system  performance. 
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Older  sim29  Log  File  Format 

For  demonstration  purposes  I  have  merely  simulated  the  “hello  world”  program 
running  on  an  Am29000  processor.  The  C  source  file  was  compiled  with  the  High  C 
29K  compiler  using  the  default  compiler  options;  object  file  hello  was  produced  by 
the  compile/link  process .  The  memory  model  was  the  simulator  default,  single-cycle 
operation.  Given  the  selection  of  default  memory  parameter,  there  is  no  need  for  an 
eventfile  establishing  memory  parameters.  However,  I  did  use  an  eventfile  with  the 
following  contents: 

0  log  on  channel 

This  option  has  not  previously  been  described;  it  enables  the  simulator  to  pro¬ 
duce  an  additional  log  file  of  channel  activity.  This  can  occasionally  be  useful  when 
studying  memory  system  operation  in  detail.  The  simulator  was  started  with  the  com¬ 
mand: 


sim29  -29000  -r  /gnu/2 9k/src/osboot/sim/osboot  -e  eventfile  hello 

Two  simulation  result  files  were  produced;  the  most  important  of  which,  the  de¬ 
fault  simulation  output  file,  sim.out,  we  shall  briefly  examine.  The  channel. out  file 
reports  all  instruction  and  data  memory  access  activity.  The  contents  of  the  sim.out 
file  are  shown  below  exactly  as  produced  by  the  simulator: 

AMD  ARCHITECTURAL  SIMULATOR,  V#  1.0-17PC 

###  T=3267  Am29000  Simulation  of  "hello"  complete  —  successful 


««<  SUMMARY  STATISTICS  >»» 

CPU  Frequency  =  25.00MHz 
Nops : 50 

total  instructions  =  2992 

User  Mode:  291  cycles  (0.00001164  seconds) 

Supervisor  Mode:  2977  cycles  (0.00011908  seconds) 

Total:  3268  cycles  (0.00013072  seconds) 

Simulation  speed:  22.89  MIPS  (1.09  cycles  per  instruction) 

-  Pipeline  - 

8.45%  idle  pipeline: 

6.46%  Instruction  Fetch  Wait 

0.46%  Data  Transaction  Wait 

0.18%  Page  Boundary  Crossing  Fetch  Wait 

0.00%  Unfilled  BTCache  Fetch  Wait 

0.49%  Load/Store  Multiple  Executing 

0.03%  Load/Load  Transaction  Wait 
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0.83%  Pipeline  Latency 


Total  Wait : 


276  cycles  (0.00001104  seconds) 


-  Branch  Target  Cache  - 

Partial  hits:  0 

Branch  btcache  access:  2418 

Branch  btcache  hits:  2143 

Branch  btcache  hit  ratio:  88.63% 

-  Translation  Lookaside  Buffer  - 

TLB  access:  0 

TLB  hits:  0 

TLB  hit  ratio:  0.00% 

-  Bus  Utilization  - 

Inst  Bus  Utilization:  70.01% 

2288  Instruction  Fetches 


Data  Bus  Utilization:  10.86% 

20  Loads 
335  Stores 


Register  File  Spilling/Filling  - 

— 

0  Spills, 

0  Fills 

Opcode  Histogram 

ILLEGAL: 

CONSTN: 6 

CONSTH : 68 

CONST : 121 

MTSRIM : 5 

CONSTHZ : 

LOADL : 

LOADL : 

CLZ  : 

CLZ  : 

EXBYTE : 

EXBYTE : 

System  Call 

Count  Histogram 

EXIT 

1:1  GETARGS  260:1 

SETVEC 

289:2 

- MEMORY  SUMMARY - 

Memory  Parameters  for  Non-banked  Regions 
I_SPEED :  ldecode=0  Iaccess=l  Ibaccess=l 


The  simulator  reports  the  total  number  of  processor  cycles  simulated.  Because 
our  example  is  brief,  there  are  few  User  mode  cycles.  Most  cycles  are  utilized  by  the 
osboot  operating  system.  The  operating  system  runs  in  Supervisor  mode  and  initial¬ 
izes  the  processor  to  run  the  “hello  world”  program  in  User  mode.  The  fast  memory 
system  has  enabled  the  processor  pipeline  to  be  kept  busy,  an  8.45%  idle  pipeline  is 
reported.  A  breakdown  of  the  activities  contributing  to  pipeline  stalling  is  shown. 

Next  reported  is  the  Branch  Target  Cache  (BTC)  activity.  If  a  processor  incorpo¬ 
rating  an  Instruction  Cache  Memory  rather  than  a  BTC  had  been  simulated,  the  corre¬ 
sponding  results  would  replace  the  BTC  results  shown.  There  were  2418  BTC  ac¬ 
cesses,  of  which  2143  found  valid  entries.  This  gives  a  hit  ratio  of  88.63%.  Partial  hits 
refer  to  the  number  of  BTC  entries  which  were  not  fully  used.  This  occurs  when  one 
of  the  early  entries  in  the  four-entry  cache  block  contains  a  jump. 
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If  the  operating  system  had  arranged  for  Translation  Look-Aside  Buffer  (TLB) 
use  then  the  next  section  reports  its  activity.  In  the  example,  the  application  ran  with 
physical  addressing  which  does  not  require  TLB  support.  Next  reported  is  bus  activ¬ 
ity.  The  large  number  of  processor  registers  results  in  little  off-chip  data  memory  ac¬ 
cess,  and  hence  Data  Bus  utilization.  The  Instruction  Bus  is  used  to  fill  the  Instruc¬ 
tion  Prefetch  Buffer  and  BTC,  and  shows  much  higher  utilization.  Typically,  pro¬ 
grams  are  more  sensitive  to  instruction  memory  performance  than  data  memory. 

The  simulator  then  produces  a  histogram  of  instruction  and  system  call  usage. 
The  listing  above  only  shows  an  extract  of  this  information,  as  it  is  rather  large.  Ex¬ 
amining  this  data  can  reveal  useful  information,  such  as  extensive  floating-point 
instruction  use. 

Finally  reported  is  a  summary  of  the  memory  modeling  parameters  used  during 
simulation.  This  information  should  match  with  the  default  parameters  or  any  param¬ 
eters  established  by  the  eventfile.  It  is  useful  to  have  this  information  recorded  along 
with  the  simulation  results. 

Newer  sim29  Log  File  Format 

As  with  the  previous  demonstration,  the  “hello  world”  program  is  used  here  to 
show  the  output  format  of  the  newer  architectural  simulator.  The  selected  processor 
is  this  time  an  Am29240  microcontroller.  The  C  source  file  was  compiled  with  the 
High  C  29K  compiler  using  the  -04  compiler  options;  object  file  hello  was  produced 
by  the  compile/link  process .  The  memory  model  was  the  simulator  default.  Given  the 
selection  of  default  memory  parameter,  there  is  no  need  for  an  eventfile  to  establish 
memory  parameters.  The  simulator  was  started  with  the  command  shown  below. 
Note,  there  is  no  need  to  use  the  -r  option  and  specify  an  osboot  file. 

sim29  -29240  hello 

The  simulation  result  file,  sim.out,  was  produced.  The  contents  of  the  sim.out 
file  are  shown  below  exactly  as  produced  by  the  simulator: 

Am292xx  Architectural  Simulator,  Version#  2.4 

Command  line:  /usr/2 9k/bin/sim240  -29240  hello 

Boot  file:  /usr/2 9k/lib/osb24x 

Text  section:  00000000  -  OOOOOOlf 

Text  section:  00000020  -  00000333 

Text  section:  00000340  -  0000035f 

Text  section:  00000360  -  00006b6b 

BSS  section:  40000400  -  400007df 

Application  file:  hello 

Text  section:  40010000  -  4001332b 

Text  section:  4001332c  -  4001333b 

Text  section:  4001333c  -  4001334b 

Data  section:  40014000  -  40014993 


86 


Evaluating  and  Programming  the  29K  RISC  Family 


Lit  section:  40014994 
BSS  section:  40014c64 


40014c63 

40014ca3 


Argv  memory:  400150a0  -  4001589f 
Heap  memory:  40015ca0  -  40035c9f 
Memory  stack:  40fbf7f0  -  40fdffef 
Register  stack:  40fdfff0  -  410007ef 
Vector  Area:  40000000  -  400003ff 


ROM:  Address 

Size 

Rd 

Wr 

Bmd 

BRd 

Wid 

0x0 

•k 

1 

1 

0 

1 

32 

RAM:  Address 

Size 

Rd 

Wr 

Pmd 

PRd 

PWr 

Wid 

Ref 

0x40000000 

•k 

2 

2 

1 

1 

1 

32 

255 

Half  speed  memory  =  0 
Starting  simulation. . . 

hello  world 

HIF  Exit:  Value  =  12 


Simulation  summary: 

Cycles:  7101 

Supervisor  mode  =  100.0% 

User  mode  =  0.0% 


MIPS  =  18.8  (25.0  Mhz  *  ((5342  instructions) / (7101  cycles))) 


Pipeline : 

Average  run  length=  5.9  instructions  between  jumps  taken 
Fetches  not  used  due  to  jumps  =299 

PipeHold:  1759  cycles  =  24.8% 

Fetch  waits:  1520  cycles  =  21.4% 

Load  waits:  133  cycles  =  1.9% 

Store  waits:  79  cycles  =  1.1% 

Load  Multiple  waits:  3  cycles  =  0.0% 

Store  Multiple  waits:  24  cycles  =  0.3% 


Channel : 


Rom:  accesses  =  809 

Rom:  average  cycles  per  access  =  1.0 
Ram:  accesses  =  1959 

Ram:  average  cycles  per  access  =  1.7 

Ram:  average  cycles  waiting  for  precharge  =  0.2 

Ram:  average  cycles  waiting  for  refresh  =  0.2 


Instruction  Cache  Size  =  4  Kbytes 

Hit  ratio  =  66.4%  (3766/5673) 


Data  Cache  Size  =  2  Kbytes 

Hit  ratio  =  63.6%  (136/214) 


The  format  of  Log  File  will  appear  familiar  to  those  experienced  with  the  older 
architectural  simulator;  the  total  number  of  processor  cycles  simulated  is  reported. 
There  are  no  User  mode  cycles  as  the  default  osboot  (osb2  4x)  executed  the  hello 
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program  in  Supervisor  mode.  Most  cycles  are  utilized  by  the  osboot  operating  sys¬ 
tem.  The  relatively  fast  memory  system  has  enabled  the  processor  pipeline  to  be  kept 
busy,  a  24.8%  idle  pipeline  is  reported.  A  breakdown  of  the  activities  contributing  to 
pipeline  stalling  is  shown.  Most  pipeline  stalls  are  due  to  instruction  fetching;  the 
DRAM  memory  has  a  2-cycle  first  access  time,  rather  than  the  ideal  1-cycle.  The 
newer  simulator  reports  the  average  number  of  instructions  executed  between  jump 
or  branch  instructions.  The  run  length  is  shown  to  be  5.9  instructions,  which  is  typical 
of  a  29K  program. 

Next  reported  is  Channel  activity.  All  load  and  store  instructions  make  use  of  the 
Channel.  Statistics  are  presented  separately  for  the  ROM/SRAM  and  DRAM 
memory  systems.  Typically,  performance  is  much  more  sensitive  to  instruction 
memory  access  rather  than  accesses  to  data.  This  is  particularly  true  with  the  29K 
family  due  to  its  large  number  of  on-chip  registers. 

Next  reported  is  on-chip  cache  activity.  There  were  5673  accesses  to  the 
instruction  cache,  of  which  66.4%  found  valid  entries.  The  Am29240  has  the  benefit 
of  a  data  cache  and  the  results  are  shown.  The  hello  program  is  small  and  only  214 
data  cache  accesses  were  made,  of  which  63.4%  hit  in  the  cache. 

Reported  in  the  sim.out  file  before  simulation  started  are  the  memory  modeling 
parameters  used  during  simulation.  This  information  should  match  with  the  default 
parameters  or  any  parameters  established  by  the  eventfile.  It  is  useful  to  have  this  in¬ 
formation  recorded  along  with  the  simulation  results.  The  values  reported  are  shown 
again  below: 


ROM:  Address  Size 

0x0  * 

RAM:  Address  Size 

0x40000000  * 


Rd 

Wr 

Bmd 

BRd 

Wid 

1 

2 

0 

1 

32 

Rd 

Wr 

Pmd 

PRd 

PWr 

Wid 

Ref 

2 

2 

1 

1 

1 

32 

255 

Half  speed  memory  =  0 

The  ROM  section  refers  to  both  ROM  and  SRAM  memory.  The  tokens  used  are 
a  little  cryptic.  For  example,  “Rd”  refers  to  memory  read  cycles.  And  “BRd”  refers  to 
burst  mode  read  times.  The  option  to  use  Scalable  Clocking  was  not  selected;  “Half 
speed  memory”  is  set  to  false. 
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Applications  Programming 


Application  programming  refers  to  the  process  of  developing  task  specific  soft¬ 
ware.  Typical  29K  tasks  are  controlling  a  real-time  process,  processing  communica¬ 
tions  data,  processing  real-time  digital  signal,  and  manipulating  video  images.  There 
are  many  more  types  of  applications,  such  as  word  processing  which  the  29Kis  suited 
for,  but  the  29K  is  better  known  in  the  embedded  engineering  community  which  typi¬ 
cally  deals  with  real-time  processing. 

This  chapter  deals  with  aspects  of  application  programming  which  the  Software 
Engineer  is  required  to  know.  Generally,  computer  professionals  spend  more  time 
developing  application  code,  compared  to  other  software  development  projects  such 
as  operating  systems.  Additionally,  applications  are  increasingly  developed  in  a  high 
level  language.  Since  C  is  the  dominant  language  for  this  task,  I  shall  present  code 
examples  in  terms  of  C.  Assembly  level  programming  is  dealt  with  in  a  separate 
chapter. 

The  first  part  of  this  chapter  deals  with  the  mechanism  by  which  one  C  proce¬ 
dure  calls  another,  and  how  they  agree  to  communicate  data  and  make  use  of  proces¬ 
sor  resources  [Mann  et  al.  1991b].  This  is  termed  the  Calling  Convention.  It  is  pos¬ 
sible  that  different  tool  developers  could  construct  their  own  calling  mechanism,  but 
this  may  lead  to  incompatibilities  in  mixing  routines  compiled  by  different  vendor 
tools.  AMD  avoided  this  problem  by  devising  a  calling  convention  which  was 
adopted  by  all  tool  developers.  Detailed  knowledge,  of  say,  individual  register  sup¬ 
port  tasks  for  the  calling  convention  is  not  presented,  except  for  the  register  and 
memory  stacks  which  play  an  important  role  in  the  29K  calling  mechanism.  In  prac¬ 
tice,  C  language  developers  typically  do  not  need  to  be  concerned  about  individual 
register  assignments,  as  it  is  taken  care  of  by  the  compiler  [Mann  1991c].  Chapter  3 
expands  on  register  assignment,  and  it  is  of  concern  here  only  in  terms  of  understand¬ 
ing  the  calling  convention  concepts  and  run-time  efficiencies. 
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Operating  system  support  services  (HIF  services)  are  then  dealt  with.  The  tran¬ 
sition  from  operating  system  to  the  application  main()  routine  is  described.  Operat¬ 
ing  system  services  along  with  other  support  routines  are  normally  accessed  through 
code  libraries.  These  libraries  are  described  for  the  predominant  tool-chains.  Using 
the  available  libraries  and  HIF  services,  it  is  an  easy  task  to  arrange  for  interrupts  to  be 
processed  by  C  language  handler  routines;  the  mechanism  is  described.  Finally,  util¬ 
ity  programs  for  operations  such  as  PROM  preparation  are  listed  and  their  capabili¬ 
ties  presented. 

2.1  C  LANGUAGE  PROGRAMMING 

Making  a  subroutine  call  on  a  processor  with  general-purpose  registers  is  ex¬ 
pensive  in  terms  of  time  and  resources.  Because  functions  must  compete  for  register 
use,  registers  must  be  saved  and  restored  through  register-to-memory  and  memory- 
to-register  operations.  For  example,  a  C  function  call  on  the  MC68000  processor 
[Motorola  1985]  might  use  the  statements: 

char  bits8; 

short  bitsl6; 

printf  ("char=%c  short=%d",  bits8,  bitsl6) ; 

After  they  are  compiled,  the  above  statements  would  generate  the  assembly- 
level  code  shown  below: 


.ascii 

"char=%c  short 

=%d\0" 

MOVE  .  W 

-4 [A6] , DO 

;  copy  bitsl6  variable 

EXT  .  L 

DO 

;  to  register 

MOVE . L 

DO, - [ A7 ] 

;  now  push  on  stack 

MOVE . B 

-1 [A6] , DO 

;  copy  bits8  variable 

EXTB  .  L 

DO 

;  to  register 

MOVE . L 

DO, - [ A7 ] 

;  now  push  on  stack 

PEA 

L15 

; stack  text  string  pointer 

JSR 

printf 

LEA 

12 [ A7 ] , A7 

; repair  stack  pointer 

The  assembly  listing  above  shows  how  parameters  pass  via  the  memory  stack  to 
the  function  being  called.  The  LINK  instruction  copies  the  stack  pointer  A  7  to  the 
local  frame  pointer  A6  upon  entry  to  a  routine.  Within  the  printf()  routine,  the  param¬ 
eters  passed  and  local  variables  in  memory  are  referenced  relative  to  register  A6. 

To  reduce  future  access  delays,  the  printf()  routine  will  normally  copy  data  to 
general-purpose  registers  before  using  them.  For  instance,  using  a  memory-to- 
memory  operation  when  moving  data  from  the  local  frame  of  the  function  call  stack 
would  reduce  the  number  of  instructions  executed.  However,  these  are  CISC  instruc¬ 
tions  that  require  several  machine  cycles  before  completion. 

In  the  example,  the  C  function  call  passes  two  variables,  bits8  and  bits  16,  to  the 
library  function  printf().  The  following  assembly  code  shows  part  of  the  printf() 
function  for  the  MC68020. 
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.print  f : 


LINK 

A6, #-32 

; local  variable 

space 

LEA 

8 [ A6 ] , AO 

; unstack  string 

pointer 

UNLK 

RTS 

A6 

; return 

Several  multi-cycle  instructions  (like  LINK  and  UNLK)  are  required  to  pass 
the  parameters  and  establish  the  function  context.  Unlike  the  variable  instruction  for¬ 
mat  in  the  MC68020,  the  29K  processor  family  has  a  fixed  32-bit  instruction  format 
(see  section  1.11).  The  same  C  statements  compiled  for  the  Am29000  processor  gen¬ 
erate  the  following  assembly  code  for  passing  the  parameters  and  establishing  the 
function  context: 


.ascii 

"char=%c  short 

=%d\0" 

const 

lr2, LI 

consth 

lr2, LI 

add 

lr3 , lr6, 0 

;move  bits8  and  bitsl6 

add 

lr4 , lr 8 , 0 

;to  bottom  of  the 
; activation  record 

call 

IrO, printf 

; return  address  in  IrO 

The  number  of  instructions  required  is  certainly  less,  and  they  are  all  simple 
single-cycle  RISC  instructions.  However,  to  better  understand  just  how  parameters 
are  passed  during  a  function  call,  explanation  of  the  procedure  activation  records  and 
their  use  of  the  local  register  file  is  first  required. 


2.1.1  Register  Stack 

A  register  stack  is  assigned  an  area  of  memory  used  to  pass  parameters  and  allo¬ 
cate  working  registers  to  each  procedure.  The  register  cache  replaces  the  top  of  the 
register  stack,  as  shown  in  Figure  2-1.  All  29K  processors  have  a  128-word  local 
register  file;  these  registers  are  used  to  implement  the  cache  for  the  top  of  the  register 
stack.  Note,  if  desired  only  a  portion  of  the  128-word  register  file  need  be  allocated  to 
register  cache  use  (see  section  2.3.2). 

The  global  registers  rab  ( grl26 )  and  rfb  ( grl27 )  point  to  the  top  and  the  bottom 
of  the  register  cache.  Global  register  rsp  (also  known  as  grl)  points  to  the  top  of  the 
register  stack.  The  register  cache,  or  stack  window,  moves  up  and  down  the  register 
stack  as  the  stack  grows  and  shrinks.  Use  of  the  register  cache,  rather  than  the 
memory  portion  of  the  register  stack,  allows  data  to  be  accessed  through  local  regis¬ 
ters  at  high  speed.  On-chip  triple-porting  of  the  register  file  (two  read  ports  and  one 
write  port  for  most  29K  family  members),  enables  the  register  stack  to  perform  better 
than  a  data  memory  cache,  which  cannot  support  read  and  write  operations  in  the 
same  cycle. 
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External  Memory 


Figure  2-1.  Cache  Window 


2.1.2  Activation  Records 

A  29K  processor  does  not  apply  push  or  pop  instructions  to  external  memory 
when  passing  procedure  parameters.  Instead  each  function  is  allocated  an  activation 
record  in  the  register  cache  at  compile  time.  Activation  records  hold  any  local  vari¬ 
ables  and  parameters  passed  to  functions. 

The  caller  stores  its  outgoing  arguments  at  the  bottom  of  the  activation  re- 
cord.The  called  function  establishes  a  new  activation  record  below  the  caller’s  re¬ 
cord.  The  top  of  the  new  record  overlaps  the  bottom  of  the  old  record,  so  that  the  out¬ 
going  parameters  of  the  calling  function  are  visible  within  the  called  functions  ac¬ 
tivation  record. 
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Although  the  activation  record  can  be  any  size  within  the  limits  of  the  physical 
cache,  the  compiler  will  not  allocate  more  than  16  registers  to  the  parameter-passing 
part  of  the  activation  record.  Functions  that  cannot  pass  all  of  their  outgoing  parame¬ 
ters  in  registers  must  use  a  memory  stack  for  additional  parameters;  global  register 
msp  ( grl25 )  points  to  the  top  of  the  memory  stack.  This  happens  infrequently,  but  is 
required  for  parameters  that  have  their  address  taken  (for  example  in  C,  c ^variable). 
Data  parameters  at  known  addresses  cannot  be  supported  in  register  address  space 
because  data  addresses  always  refer  to  memory,  not  to  registers. 

The  following  code  shows  part  of  the  printf()  function  for  the  29K  family: 

printf : 


sub 

gr 1 , gr 1 , 1 6 

; function  prologue 

asgeu 

V_SPILL, gr 1 , rab 

; compare  with  top  of  window 

add 

lr 1 , gr 1 , 3  6 

; rab  is  grl26 

jmpi 

IrO 

; return 

asleu 

V_FILL, Irl, rfb 

; compare  with  bottom 
; of  window  gr!27 

The  register  stack  pointer,  rsp,  points  to  the  bottom  of  the  current  activation  re¬ 
cord.  All  local  registers  are  referenced  relative  to  rsp.  Four  new  registers  are  required 
to  support  the  function  call  shown,  so  rsp  is  decremented  16  bytes.  Register  rsp  per¬ 
forms  a  role  similar  to  the  MC68000’sA7andA<5registers,  except  that  it  points  to  data 
in  high-speed  registers,  not  data  in  external  memory. 

The  compiler  reserves  local  registers  IrO  and  Irl  for  special  duties  within  each 
activation  record.  The  IrO  contains  the  execution  starting  address  when  it  returns  to 
the  caller’s  activation  record.  The  Irl  points  to  the  top  of  the  caller’s  activation  re¬ 
cord,  the  new  frame  allocates  local  registers  lr2  and  lr3  to  hold  printf()  function  local 
variables. 

As  Figure  2-2  shows,  the  positions  of  five  registers  overlap.  The  three  printf() 
parameters  enter  from  lr2,  lr3  and  lr4  of  the  caller’s  activation  record  and  appear  as 
lr6,  lr7  and  IrH  of  the  printf()  function  activation  record. 


2.1.3  Spilling  And  Filling 

If  not  enough  registers  are  available  in  the  cache  when  it  moves  down  the  regis¬ 
ter  stack,  then  a  V_SPILL  trap  is  taken,  and  the  registers  spill  out  of  the  cache  into 
memory.  Only  procedure  calls  that  require  more  registers  than  currently  are  available 
in  the  cache  suffer  this  overhead. 

Once  a  spill  occurs,  a  fill  (V_FILL  trap)  can  be  expected  at  a  later  time.  The  fill 
does  not  happen  when  the  function  call  causing  the  spill  returns,  but  rather  when 
some  earlier  function  that  requires  data  held  in  a  previous  activation  record  (just  be¬ 
low  the  cache  window)  returns.  Just  before  a  function  returns,  the  Irl  register,  which 
points  to  the  top  of  the  caller’s  activation  record,  is  compared  with  the  pointer  to  the 
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higher  addresses 


top  of  activation 
record 


Ir8 

in-coming  pram 

Ir7 

in-coming  pram 

Ir6 

in-coming  pram 

Ir5 

frame  pointer 

Ir4 

return  address 

Ir3 

local 

Ir2 

local 

—  Irl 

frame  pointer 

grl  (rsp)  — ►  IrO 

when  printf() 

executes 

Ir4 

Ir3 

Ir2 

Irl 


IrO 


printf()  activation  record  is 
9  words.  Register  grl  is 
lowered  4  words  (1 6  bytes) 
in  the  prologue  of  printfQ. 


Base  of  caller’s  activation 
record  (grl  before  printf() 
is  called) 


base  of  printf() 
activation  record 


Figure  2-2.  Overlapping  Activation  Record  Registers 


bottom  of  the  cache  window) r/fo).  If  the  activation  record  is  not  stored  completely  in 
the  cache,  then  a  fill  overhead  occurs. 

The  register  stack  improves  the  performance  of  call  operations  because  most 
calls  and  returns  proceed  without  any  memory  access.  The  register  cache  contains 
128  registers,  so  very  few  function  calls  or  returns  require  register  spilling  or  filling. 

Because  most  of  the  data  required  by  a  function  resides  in  local  registers,  there  is 
no  need  for  elaborate  memory  addressing  modes,  which  increase  access  latency.  The 
function-call  overhead  in  the  29K  family  consists  of  a  small  number  of  single-cycle 
instructions;  the  overhead  in  the  MC68020  requires  a  greater  number  of  multi-cycle 
instructions. 

2.1.4  Global  Registers 

In  the  discussion  of  activation  records  (section  2.1.2),  it  was  stated  that  func¬ 
tions  can  use  activation  space  (local  registers)  to  hold  procedure  variables.  This  is 
true,  but  procedures  can  also  use  processor  global  registers  to  hold  variables.  Each 
29K  processor  has  a  group  of  registers  (global  registers)  which  are  located  in  the  reg¬ 
ister  file,  but  are  not  part  of  the  register  cache.  Global  registers  gr96-grl27  are  used 
by  application  programs.  When  developing  software  in  C,  there  is  no  need  to  know 
just  how  the  compiler  makes  use  of  these  global  registers;  the  Assembly  Level  Pro¬ 
gramming  chapter,  section  3.3,  discusses  register  allocation  in  detail. 
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Data  held  in  global  registers,  unlike  procedure  activation  records,  do  not  survive 
procedure  calls.  The  compiler  has  25  global  registers  available  for  holding  temporary 
variables.  These  registers  perform  a  role  very  similar  to  the  eight-data  and  eight-ad- 
dress  general  purpose  registers  of  the  MC68020.  The  first  16  of  the  global  registers, 
gr96-grlll,  are  used  for  procedure  return  value  passing.  Return  objects  larger  than 
16  words  must  use  the  memory  stack  to  return  data  (see  section  3.3). 

An  extension  to  some  C  compilers  has  been  made  (High  C®  29K™  compiler  for 
one),  enabling  a  calling  procedure  to  assume  that  some  global  registers  will  survive  a 
procedure  call.  If  the  called  function  is  defined  before  calls  are  made  to  it,  the  compil¬ 
er  can  determine  its  register  usage.  This  enables  the  global  register  usage  of  the  call¬ 
ing  function  to  be  restricted  to  available  registers,  and  the  calling  function  need  only 
save  in  local  registers  those  global  registers  it  knows  are  used  by  the  by  the  callee. 

2.1.5  Memory  Stack 

Because  the  register  cache  is  limited  in  size,  a  separate  memory  stack  is  used  to 
hold  large  local  variables  (structs  or  arrays),  as  well  as  any  incoming  parameters  be¬ 
yond  the  16th  parameter.  (Note,  small  structs  can  still  be  passed  in  local  registers  as 
procedure  parameters).  Register  msp  is  the  memory  stack  pointer.  (Note,  having  two 
stacks  generally  requires  several  operating  system  support  mechanisms  not  required 
by  a  single  stack  CISC  based  system.) 

2.2  RUN-TIME  HIF  ENVIRONMENT 

Application  programs  need  to  interact  with  peripheral  devices  which  support 
communication  and  other  control  functions.  Traditionally  embedded  program  devel¬ 
opers  have  not  been  well  served  by  the  tools  available  to  tackle  the  related  software 
development.  For  example,  performing  the  popular  C  library  service  printf(),  using 
a  peripheral  UART  device,  may  involve  developing  the  printf()  library  code  and 
then  underlying  operating  system  code  which  controls  the  communications  UART. 
One  solution  to  the  problem  is  to  purchase  a  real-time  operating  system.  They  are 
normally  supplied  with  libraries  which  support  printf()  and  other  popular  library  ser¬ 
vices.  In  addition,  operating  systems  contain  code  to  perform  task  context  switching 
and  interrupt  handling. 

Typically,  operating  system  vendors  have  their  own  operating  system  interface 
specification.  This  means  that  library  code,  like  printf(),  which  ultimately  makes  op¬ 
erating  system  service  requests,  is  not  easily  ported  between  different  operating  sys¬ 
tems.  In  addition,  compiler  vendors  which  typically  develop  library  code  for  the  tar¬ 
get  processor  for  sale  along  with  the  compiler,  can  not  be  assured  of  a  standard  inter¬ 
face  to  the  available  operating  system  services. 

AMD  wished  to  relieve  this  problem  and  allow  library  code  to  be  used  on  any 
target  29K  platform.  In  addition  AMD  wished  to  ensure  a  number  of  services  would 
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be  available.  These  operating  system  services  were  considered  necessary  to  enable 
performance  benchmarking  of  application  code  (for  example  the  cycles  service  re¬ 
turns  a  56-bit  elapsed  processor  cycle  count).  The  result  was  the  Host  Interface  spec¬ 
ification,  known  as  HIF.  It  specifies  a  number  of  operating  system  services  which 
must  always  be  present.  The  list  is  very  small,  but  it  enables  library  producers  to  be 
assured  that  their  code  will  run  on  any  29K  platform.  The  HIF  specification  states 
how  a  system  call  will  be  made,  how  parameters  will  be  passed  to  the  operating  sys¬ 
tem,  and  how  results  will  be  returned.  Operating  system  vendors  need  not  support 
HIF  conforming  services  if  they  wish;  they  could  just  continue  to  use  their  own  oper¬ 
ating  system  interface  and  related  library  routines.  But  to  make  use  of  the  popular 
library  routines  from  the  Metaware  High  C  29K  compiler  company,  the  operating 
system  company  must  virtualize  the  HIF  services  on  top  of  the  underlying  operating 
system  services. 

The  original  specification  grew  into  what  is  now  known  as  HIF  2.0.  The  specifi¬ 
cation  includes  services  for  signal  handling  (see  following  sections  on  C  language 
interrupt  handlers),  memory  management  support,  run-time  environment  initializa¬ 
tion  and  other  processor  configuration  options.  Much  of  this  development  was  a  re¬ 
sult  of  AMD  developing  a  small  collection  of  routines  known  as  OS-boot  (see  sec¬ 
tion  7.4).  This  code  can  take  control  of  the  processor  from  RESET,  prepare  the  run¬ 
time  environment  for  a  HIF  conforming  application  program,  and  support  any  HIF 
request  made  by  the  application.  OS-boot  effectively  implements  a  single  applica¬ 
tion-task  operating  system.  It  is  adequate  for  many  user  requirements,  which  may  be 
merely  to  benchmark  29K  applications.  With  small  additions  and  changes  it  is  ade¬ 
quate  for  many  embedded  products.  However,  some  of  the  HIF  2.0  services,  re¬ 
quested  by  the  community  who  saw  OS-boot  as  an  adequate  operating  system,  were 
of  such  a  nature  that  they  often  cannot  be  implemented  in  an  operating  system  ven¬ 
dor’s  product.  For  example  the  settrap  service  enables  an  entry  to  be  placed  directly 
into  the  processor’s  interrupt  vector  table;  some  operating  systems,  for  example 
UNIX,  will  not  permit  this  to  occur  as  it  is  a  security  risk  and,  if  improperly  used,  an 
effective  way  to  crash  the  system. 

There  are  standard  memory,  register  and  other  initializations  that  must  be  per¬ 
formed  by  a  HIF-conforming  operating  system  before  entry  into  a  user  program.  In  C 
language  programs,  this  is  usually  performed  by  the  module  crtO.s.  This  module  re¬ 
ceives  control  when  an  application  program  is  invoked,  and  executes  prior  to  invoca¬ 
tion  of  the  user’s  main()  function.  Other  high-level  languages  have  similar  modules. 
The  following  three  sections  describe:  what  a  HIF  conforming  operating  system 
must  perform  before  code  in  crtO.s  starts  executing;  what  is  typically  achieved  in 
crtO.s  code;  and  finally,  what  mn-time  services  are  specified  in  HIF  2.0. 
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2.2.1  OS  Preparations  before  Calling  start  In  crtO 


According  to  the  HIF  specification,  operating  system  initialization  procedures 
must  establish  appropriate  values  for  the  general  registers  mentioned  below  before 
execution  of  a  user’s  application  code  commences.  Linked  application  code  normal¬ 
ly  commences  at  address  label  start  in  module  crtO.s.  This  module  is  automatically 
linked  with  application  code  modules  and  libraries  when  the  compiler  is  used  to  pro¬ 
duce  the  final  application  executable.  In  addition,  file  descriptors  for  the  standard  in¬ 
put  and  output  devices  must  be  opened,  and  any  Am29027  floating-point  coproces¬ 
sor  support  as  well  as  other  trapware  support  must  be  initialized. 

Register  Stack  Pointer  ( grl ) 

Register  grl  points  to  the  top  of  the  register  stack.  It  contains  the  main  memory 
address  in  which  the  local  register  IrO  will  be  saved,  should  it  be  spilled,  and  from 
which  it  will  be  restored.  The  processor  can  also  use  the  grl  register  as  the  base  in 
base-plus-offset  addressing  of  the  local  register  file.  The  content  of  rsp  is  compared 
to  the  content  of  rab  to  determine  when  it  is  necessary  to  spill  part  of  the  local  register 
stack  to  memory.  On  startup,  the  values  in  rab,  rsp,  and  rfb  should  be  initialized  to 
prevent  a  spill  trap  from  occurring  on  entry  to  the  crtO  code,  as  shown  by  the  follow¬ 
ing  relations: 


((64*4)  +  rab)  <  rsp  <  rfb 
rfb  =  rab  +  512 

This  provides  the  crtO  code  with  at  least  64  registers  on  entry,  which  should  be  a 
sufficient  number  to  accomplish  its  purpose.  Note,  rab  and  rfb  are  normally  set  to  be  a 
window  distance  apart,  128  words  (512  bytes),  but  this  is  not  the  only  valid  settings, 
see  section  2.3.2  and  4.3.1. 

Register  Free  Bound  ( gr127) 

The  register  stack  free-bound  pointer,  rfb,  contains  the  register  stack  address  of 
the  lowest-addressed  word  not  contained  within  the  register  file.  Register  rfb  is  refer¬ 
enced  in  the  epilog  of  most  user  program  functions  to  determine  whether  a  register 
fill  operation  is  necessary  to  restore  previously  spilled  registers  needed  by  the  func¬ 
tion’s  caller.  The  rfb  register  should  be  initialized  to  point  to  the  highest  address  of  the 
memory  region  allocated  for  register  stack  use.  It  is  recommended  that  this  memory 
region  not  be  less  than  6k  bytes. 

Register  Allocate  Bound  (grl 26) 

The  register  stack  allocate-bound  pointer,  rab,  contains  the  register  stack  ad¬ 
dress  of  the  lowest-addressed  word  contained  within  the  register  file.  Register  rab  is 
referenced  in  theprolog  of  most  user  program  functions  to  determine  whether  a  regis- 
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ter  spill  operation  is  necessary  to  accommodate  the  local  register  requirements  of  the 
called  function.  Register  rab  is  normally  initialized  to  be  a  window  distance  (5 12  by¬ 
tes)  below  the  rfb  register  value 

Memory  Stack  Pointer  ( gr125) 

The  memory  stack  pointer  ( msp )  register  points  to  the  top  of  the  memory  stack, 
which  is  the  lowest-addressed  entry  on  the  memory  stack.  Register  msp  should  be 
initialized  to  point  to  the  highest  address  in  the  memory  region  allocated  for  memory 
stack  use.  It  is  recommended  that  this  region  not  be  less  than  2k  bytes. 

Am29027  Floating-Point  Coprocessor  Support 

The  Am29027  floating-point  coprocessor  has  a  mode  register  which  has  a 
cumbersome  access  procedure.  To  avoid  accessing  the  mode  register  a  shadow  copy 
is  kept  by  the  operating  system  and  accessed  in  preference  when  a  mode  register  read 
is  required.  The  operating  system  shadow  mode  value  is  not  accessible  to  User  mode 
code,  therefore  an  application  must  maintain  its  own  shadow  mode  register  value. 
The  floating-point  library  code  which  maintains  and  accesses  the  shadow  mode  val¬ 
ue,  is  passed  the  mode  setting,  initialized  by  the  operating  system,  when  crtO  code 
commences.  Before  entering  crtO,  the  Am29027  mode  register  value  is  copied  into 
global  registers  gr96  and  gr97.  Register  gr96  contains  the  most  significant  half  of  the 
mode  register  value,  and  gr97  contains  the  least  significant  half. 

Open  File  Descriptors 

File  descriptor  0  (corresponding  to  the  standard  input  device)  must  be  opened 
for  text  mode  input.  File  descriptors  1  and  2  (corresponding  to  standard  output  and 
standard  error  devices)  must  be  opened  for  text  mode  output  prior  to  entry  to  the 
user’s  program.  File  descriptors  0, 1,  and  2  are  expected  to  be  in  COOKED  mode  (see 
Appendix  A,  ioctl()  service),  and  file  descriptor  0  should  also  select  ECHO  mode,  so 
that  input  from  the  standard  input  device  (stdin)  is  echoed  to  the  standard  output  de¬ 
vice  (stdout). 

Software  Emulation  and  Trapware  Support 

A  29K  processor  may  take  a  trap  in  support  of  the  procedure  call  prologue  and 
epilogue  mechanism.  A  HIF  conforming  operating  system  supports  the  associated 
SPILL  and  LILL  traps  by  normally  maintaining  two  global  registers  (in  the 
gr64-gr95  range)  which  contain  the  address  of  the  users  spill  and  fill  code.  Keeping 
these  addresses  available  in  registers  reduces  the  delay  in  reaching  the  typically  User 
mode  support  code.  A  HIL  conforming  operating  system  also  installs  the  SPILL  and 
LILL  trap  handler  code  which  bounces  execution  to  the  maintained  handler  address¬ 
es. 
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Table  2-1.  Trap  Handler  Vectors 


Trap 

Description 

32 

MULTIPLY 

33 

DIVIDE 

34 

MULTIPLU 

35 

DIVID 

36 

CONVERT 

42 

FEQ 

43 

DEQ 

44 

FGT 

45 

DGT 

46 

FGE 

47 

DGE 

48 

FADD 

49 

DADD 

50 

FSUB 

51 

DSUB 

52 

FMUL 

53 

DMUL 

54 

FDIV 

55 

DDIV 

64 

V_SPILL  (Set  up  by  the  user’s  task  through  a  setvec  call) 

65 

V_FILL  (Set  up  by  the  user’s  task  through  a  setvec  call) 

69 

HIF  System  Call 

Note:  The  V_SPILL  (64)  and  V_FILL  (65)  traps  are  returned  to  the  user’s  code  to  perform  the  trap 
handling  functions.  Application  code  normally  runs  in  User  mode. 


Additionally,  the  trapware  code  enabling  HIF  operating  system  calls  is 
installed.  Also,  all  HIF  conforming  operating  systems  provide  unaligned  memory 
access  trap  handlers. 

A  number  of  29K  processors  do  not  directly  support  floating-point  instructions 
in  hardware  (see  section  3.1.7).  However  the  HIF  environment  requires  that  all 
Am29000  User  mode  accessible  instructions  be  implemented  across  the  entire  29K 
family.  This  means  that  unless  an  Am29050  processor  is  being  used,  trapware  must 
be  installed  to  emulate  in  software  the  floating-point  instructions  not  directly  sup¬ 
ported  by  the  hardware.  Table  2-1  lists  the  traps  which  an  HIF  conforming  operating 
system  must  establish  support  for  before  calling  crtO  code. 

When  a  29K  processor  is  supported  by  an  Am29027  floating-point  coproces¬ 
sor,  the  operating  system  may  chose  to  use  the  coprocessor  to  support  floating-point 
instruction  emulation.  For  example,  the  trapware  routine  used  for  emulating  the 
MULTIPLY  instruction  is  know  as  Emultiply;  however,  if  the  coprocessor  is  avail¬ 
able  the  E7multiply  routine  is  used. 


Chapter  2  Applications  Programming 


99 


2.2.2  crtO  Preparations  before  Calling  malnQ 

Application  code  normally  begins  execution  at  address  start  in  the  crtO.s  mod¬ 
ule.  The  previous  section  described  the  environment  prepared  by  a  HIF  conforming 
operating  system  before  the  code  in  crtO.s  is  executed.  The  crtO.s  code  makes  final 
preparations  before  the  application  main()  procedure  is  called. 

The  code  in  crtO.s  first  copies  the  Am29027  shadow  mode  register  value,  passed 

in  gr96  and  gr97,  to  memory  location _ 29027Mode.  If  a  system  does  not  have  an 

Am29027  floating-point  coprocessor  then  there  is  no  useful  data  passed  in  these  reg¬ 
isters.  However,  application  code  linked  with  floating-point  libraries  which  make 
use  of  the  Am29027  coprocessor,  will  access  the  shadow  memory  location  to  deter¬ 
mine  the  coprocessor  operating  mode  value. 

The  setvec  system  call  is  then  used  to  supply  the  operating  system  with  the  ad¬ 
dresses  of  the  user’s  SPILL  and  FILL  handler  code  which  is  located  in  crtO.s.  Be¬ 
cause  this  code  normally  runs  in  User  mode  address  space,  and  the  user  has  the  option 
to  tailor  the  operation  of  this  code,  an  operating  system  can  not  know  in  advance 
(pre-crtO.s)  the  required  SPILL  and  FILL  handler  code  operation. 

When  procedure  main()  is  called,  it  is  passed  two  parameters;  the  argc  parame¬ 
ter  indicates  the  number  of  elements  in  argv;  the  second  parameter,  argv,  is  a  pointer 
to  an  array  of  the  character  strings: 

main  (argc,  argv) 
int  argc; 
char*  argv [ ] ; 

The  getargs  HIF  service  is  used  to  get  the  address  of  the  argv  array.  In  many 
real-time  applications  there  are  no  parameters  passed  to  main().  However,  to  support 
porting  of  benchmark  application  programs,  many  systems  arrange  for  main()  pa¬ 
rameters  to  be  loaded  into  a  user’s  data  space.  The  crtO.s  code  walks  through  the 
array  looking  for  a  NULL  terminating  string;  in  so  doing,  it  determines  the  argc  val¬ 
ue.  The  register  stack  pointer  was  lowered  by  the  start()  procedure’s  prologue  code 
to  create  a  procedure  activation  record  for  passing  parameters  to  main(). 

To  aid  run-time  libraries  a  memory  variable, _ Liblnit,  is  defined  in  uninitial¬ 

ized  data  memory  space  (BSS)  by  the  library  code.  If  any  library  code  needs  initial¬ 
ization  before  use,  then  the _ Liblnit  variable  will  be  assigned  to  point  to  a  library 

routine  which  will  perform  all  necessary  initialization.  This  is  accomplished  by  the 

linker  matching-up  the  BSS _ Liblnit  variable  with  an  initialized _ .Liblnit  variable 

defined  in  the  library  code.  The  crtO.s  code  checks  to  see  if  the  Liblnit  variable 
contains  a  non  zero  address,  if  so,  the  procedure  is  called. 

The  application  main()  procedure  is  ready  to  be  called  by  start().  It  is  not  ex¬ 
pected  that  main()  will  return.  Real-time  programs  typically  never  exit.  However, 
benchmark  programs  do,  and  this  is  accomplished  by  calling  the  HIF  exit  service.  If  a 
main()  routine  does  not  explicitly  call  exit  then  it  will  return  to  start(),  where  exit  is 
called  should  main()  return. 
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2.2.3  Run-Time  HIF  Services 

Table  2-2  lists  the  HIF  system  call  services,  calling  parameters,  and  the  returned 
values.  If  a  column  entry  is  blank,  it  means  the  register  is  not  used  or  is  undefined. 
Table  2-3  describes  the  parameters  given  in  Table  2-2  .  Before  invoking  a  HIF  ser¬ 
vice,  the  service  number  and  any  input  parameters  passed  to  the  operating  system  are 
loaded  into  assigned  global  registers.  Each  HIF  service  is  identified  by  its  associated 
service  number  which  is  placed  in  global  register  grl21.  Parameters  are  passed,  as 
with  procedure  calls,  in  local  registers  starting  with  lr2.  Application  programs  do  not 
need  to  issue  ASSERT  instructions  directly  when  making  service  calls.  They  normal¬ 
ly  use  a  library  of  assembly  code  glue  routines.  The  write  service  glue  routine  is 
shown  below: 


e : 

; HIF  assembly  glue  routine  for  write 

const 

grl21, 20 

;tav  is  grl21 

asneq 

69, grl , grl 

; system  call  trap 

jmpti 

gr 12 1 , IrO 

; return  if  sucessful 

const 

gr 122 ,  _errno 

;pass  errror  number 

consth 

gr 122 , _errno 

store 

0 ,  0 ,  gr 12 1 , gr 122 

; store  errnor  number 

jmpi 

IrO 

; return  if  failure 

constn 

gr96, -1 

service 


Application  programs  need  merely  call  the  _write()  leaf  routine  to  issue  the  ser¬ 
vice  request.  The  system  call  convention  states  that  return  values  are  placed  in  global 
registers  starting  with  gr96\  this  makes  the  transfer  of  return  data  by  the  assembly 
glue  routine  very  simple  and  efficient.  If  a  service  fails,  due  to,  say,  bad  input  parame¬ 
ters,  global  register  grl21  is  returned  with  an  error  number  supplied  by  the  operating 
system.  If  the  service  was  successful,  grl21  is  set  to  Boolean  TRUE  (0x80000000). 
The  glue  routines  check  the  grl21  value,  and  if  it  is  not  TRUE,  copy  the  value  to 
memory  location  errno.  This  location,  unlike  grl21  is  directly  accessible  by  a  C  lan¬ 
guage  application  which  requested  the  service. 

Run-time  HIF  services  are  divided  into  two  groups,  they  are  separated  by  their 
service  number.  Numbers  255  and  less  require  the  support  of  complex  operating  sys¬ 
tem  services  such  as  file  system  management.  Service  numbers  256  and  higher  relate 
to  simpler  service  tasks.  Note,  AMD  reserves  service  numbers  0-127  and  256-383 
for  HIF  use.  Users  are  free  to  extend  operating  system  services  using  the  unreserved 
service  numbers.  Operating  systems  which  implement  HIF,  OS-boot  for  example, 
do  not  always  directly  support  services  255  and  lower.  These  HIF  services  are  often 
translated  into  native  operating  system  calls  which  are  virtualising  HIF  services.  For 
example,  when  a  HIF  conforming  application  program  is  running  on  a  UNIX-based 
system,  the  HIF  services  are  translated  into  the  underlying  UNIX  services.  OS-boot 
supports  the  more  complex  services  by  making  use  of  the  MiniMON29K  message 
system  to  communicate  the  service  request  to  a  debug  support  host  processor  (see 
Chapter  7).  For  this  reason,  services  255  and  lower  are  not  always  available.  Services 
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Table  2-2.  HIF  Service  Calls 


Service 

Calling  Parameters 

Returned  Values 

Title 

gr121 

Ir2 

Ir3  Ir4 

gr96  gr97 

gr121 

exit 

1 

exitcode 

Service  does  not  return 

open 

17 

pathname 

mode  pflag 

fileno 

errcode 

close 

18 

fileno 

retval 

errcod 

read 

19 

fileno 

buffptr  nbytes 

count 

errcode 

write 

20 

fileno 

buffptr  nbytes 

count 

errcode 

Iseek 

21 

fileno 

offset  orig 

where 

errcode 

remove 

22 

pathname 

retval 

errcode 

rename 

23 

oldfile 

newfile 

retval 

errcode 

ioctl 

24 

fileno 

mode 

errcode 

iowait 

25 

fileno 

mode 

count 

errcode 

iostat 

26 

fileno 

iostat 

errcode 

tmpnam 

33 

addrptr 

filename 

errcode 

time 

49 

secs 

errcode 

65 

name 

addrptr 

errcode 

67 

zonecode  dstcode 

errcode 

sysalloc 

257 

nbytes 

addrptr 

errcode 

sysfree 

258 

addrptr 

nbytes 

retval 

errcode 

getpsize 

259 

pagesize 

errcode 

getargs 

260 

baseaddr 

errcode 

clock 

273 

msecs 

errcode 

cycles 

274 

LSBs  cycles  MSBs  cycles 

errcode 

setvec 

289 

trapno 

funaddr 

trapaddr 

errcode 

settrap 

290 

trapno 

trapaddr 

trapaddr 

errcode 

setim 

291 

mask 

di 

mask 

errcode 

query 

305 

capcode 

hifvers 

errcode 

capcode 

cpuvers 

errcode 

capcode 

027vers 

errcode 

capcode 

clkfreq 

errcode 

capcode 

memenv 

errcode 

signal 

321 

newsig 

oldsig 

errcode 

sigdfl 

322 

[grl 25  points  to  HIF  signal  frame] 

Service  does  not  return 

sigret 

323 

[grl 25  points  to  HIF  signal  frame] 

Service  does  not  return 

sigrep 

324 

[grl 25  points  to  HIF  signal  frame] 

Service  does  not  return 

sigskp 

325 

[grl 25  points  to  HIF  signal  frame] 

Service  does  not  return 

sendsig 

326 

sig 

errcode 

with  numbers  256  and  higher  do  not  require  the  support  of  a  remote  host  processor. 
These  services  are  implemented  directly  by  OS-boot.  If  an  underlying  operating  sys¬ 
tem,  such  as  UNIX,  is  being  used,  then  some  of  these  services  may  not  be  available  as 
they  may  violate  the  underlying  operating  system’s  security. 
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When  application  benchmark  programs  use  HIF  services,  care  should  be  taken. 
If  a  program  requests  a  service  such  as  time  (service  49)  it  will  suffer  the  delays  of 
communicating  the  service  request  to  a  remote  host  if  the  OS-boot  operating  system 
is  used.  This  can  greatly  effect  the  performance  of  a  program,  as  execution  will  be 
delayed  until  the  remote  host  responds  to  the  service  request.  It  is  better  to  use  ser¬ 
vices  such  as  cycles  (service  274)  or  clock  (service  273)  which  are  executed  by  the 
29K  processor  and  do  not  suffer  the  delays  of  remote  host  communication. 

The  assembly  level  glue  routines  for  HIF  services  255  and  lower  are  rarely  re¬ 
quested  directly  by  an  application  program.  They  are  more  frequently  called  upon  by 
library  routines.  For  example,  use  of  the  library  printf()  routine  is  the  typical  way  of 
generating  a  write  HIF  service  request.  The  mapping  between  library  routines  and 
HIF  services  may  not  be  always  direct.  The  printf()  routine,  when  used  with  a  device 
operating  in  COOKED  mode,  may  only  request  write  services  when  flushing  buffers 
supporting  device  communication.  Appendix  A  contains  a  detailed  description  of 
each  HIF  service  in  terms  of  input  and  output  parameters,  as  well  as  error  codes. 


2.2.4  Switching  to  Supervisor  Mode 

Operating  systems  which  conform  to  HIF  normally  run  application  code  in  User 
mode.  However,  many  real-time  applications  require  access  to  resources  which  are 
restricted  to  Supervisor  mode.  If  the  HIF  settrap  service  is  supported,  it  is  easy  to 
install  a  trap  handler  which  causes  application  code  to  commence  execution  in  Su¬ 
pervisor  mode.  The  example  code  sequence  below  uses  the  settrapO  HIF  library  rou¬ 
tine  to  install  a  trap  handler  for  trap  number  70.  The  trap  is  then  asserted  using  assem¬ 
bly  language  glue  routine  assert_70(). 


extern  int  super_mode ( ) ; /*  Here  in  User  mode  */ 

_settrap (70, super_mode) ; /*  install  trap  handler  */ 
assert_70();  /*  routine  to  assert  trap  */ 

.  .  .  /*  Here  in  Supervisor  mode  */ 


The  trap  handler  is  shown  below.  Its  operation  is  very  simple;  it  sets  the  Supervi¬ 
sor  mode  bit  in  the  old  processors  status  registers  (OPS)  before  issuing  a  trap  return 
instruction  (IRET).  Other  application  status  information  is  not  affected.  For  exam¬ 
ple,  if  the  application  was  running  with  address  translation  turned  on,  then  it  will  con¬ 
tinue  to  run  with  address  translation  on,  but  now  in  Supervisor  mode. 

In  fact  the  example  relies  on  application  code  running  with  physical  addressing; 
or  if  the  Memory  Management  Unit  is  used  to  perform  address  translation,  then  virtu¬ 
al  addresses  are  mapped  directly  to  physical  addresses.  This  is  because  the  Freeze 
mode  handler,  super_mode(),  runs  in  Supervisor  mode  with  address  translation 
turned  off.  But  the  settrap  system  call,  which  installs  the  super_mode()  handler  ad¬ 
dress,  runs  in  User  mode  and  thus  operates  with  User  mode  address  values. 
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. global 
super_mode : 

_super_mode 

;  gr  64 

is  an  OS  temporary 

mf  sr 

gr 64 , ops 

;  read 

the  OPS  register 

or 

gr64 , gr64 , 0x10 

;  set 

SM  bit  in  OPS 

mtsr 

ops , gr 64 

;  iret 

back  to  Supervisor  mode 

iret 

The  super_mode()  and  assert_70()  routines  have  to  be  written  in  assembly  lan¬ 
guage.  The  IRET  instruction  in  super_mode()  starts  execution  of  the  JMPI  instruc¬ 
tion  in  the  assert_70()  routine  shown  below.  The  method  shown  of  forcing  a  trap  can 
be  used  to  test  a  systems  interrupt  and  trap  support  software. 


.  global 
assert_7 0 : 

asneq 

jmpi 

nop 


_assert_70 

70, gr96, gr96 
IrO 


; leaf  routine 
; force  trap  70 
;  return 


Table  2-3.  HIF  Service  Call  Parameters 


Parameter  Description 


027vers 

addrptr 

baseaddr 

buffptr 

capcode 


clkfreq 

count 

cpuvers 

cycles 

di 

dstcode 

errcode 

exitcode 


The  version  number  of  the  installed  Am29027  arithmetic  accelerator  chip  (if  any) 

A  pointer  to  an  allocated  memory  area,  a  command-line-argument  array,  a  path¬ 
name  buffer,  or  a  NULL-terminated  environment  variable  name  string. 

The  base  address  of  the  command-line-argument  vector  returned  by  the  getargs 
service. 

A  pointer  to  the  buffer  area  where  data  is  to  be  read  from  or  written  to  during  the 
execution  of  I/O  services,  or  the  buffer  area  referenced  by  the  wait  service. 

The  capabilities  request  code  passed  to  the  query  service.  Code  values  are:  0  (re¬ 
quest  HIF  version),  1  (request  CPU  version),  2  (request  Am29027  arithmetic  accel¬ 
erator  version),  3  (request  CPU  clock  frequency),  and  4  (request  memory  environ¬ 
ment). 

The  CPU  clock  frequency  (in  Hertz)  returned  by  the  query  service. 

The  number  of  bytes  actually  read  from  file  or  written  to  a  file. 

The  CPU  family  and  version  number  returned  by  the  query  service. 

The  number  of  processor  cycles  (returned  value). 

The  disable  interrupts  parameter  to  the  setim  service. 

The  daylight  savings  time  in  effect  flag  returned  by  the  gettz  service. 

The  error  code  returned  by  the  service.  These  are  usually  the  same  as  the  codes 
returned  in  the  UNIX  errno  variable. 

The  exit  code  of  the  application  program. 

(continued) 
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Table  2-4.  HIF  Sen’ice  Call  Parameters  ( Concluded) 

(continued) 


Parameter  Description 


filename 

fileno 


funaddr 

hifvers 

iostat 

mask 

memenv 

mode 

msecs 

name 

nbytes 

newfile 

newsig 

offset 

oldfile 

oldsig 

orig 

pagesize 

pathname 

pflag 

retval 

secs 

sig 

trapaddr 

trapno 

where 

zonecode 


A  pointer  to  a  NULL-terminated  ASCII  string  that  contains  the  directory  path  of  a  tem¬ 
porary  filename. 

The  file  descriptor  which  is  a  small  integer  number.  File  descriptors  0,  1 ,  and  2  are 
guaranteed  to  exist  and  correspond  to  open  files  on  program  entry  (0  refers  to  the 
UNIX  equivalent  of  stdin  and  is  opened  for  input;  1  refers  to  the  UNIX  stdout,  and  is 
opened  for  output;  2  refers  to  the  UNIX  stderr,  and  is  opened  for  output). 

A  pointer  to  the  address  of  a  spill  or  fill  handler  passed  to  the  setvec  service. 

The  version  of  the  current  HIF  implementation  returned  by  the  query  service. 

The  input/output  status  returned  by  the  iostat  service. 

The  interrupt  mask  value  passed  to  and  returned  by  the  setim  service. 

The  memory  environment  returned  by  the  query  service. 

A  series  of  option  flags  whose  values  represent  the  operation  to  be  performed.  Used 
in  the  open,  ioctl,  and  wait  services  to  specify  the  operating  mode. 

Milliseconds  returned  by  the  clock  service. 

A  pointer  to  a  NULL-terminated  ASCII  string  that  contains  an  environment  variable 
name. 

The  number  of  data  bytes  requested  to  be  read  from  or  written  to  a  file,  or  the  number 
of  bytes  to  allocate  or  deallocate  from  the  heap. 

A  pointer  to  a  NULL-terminated  ASCII  string  that  contains  the  directory  path  of  a  new 
filename. 

The  address  of  the  new  user  signal  handler  passed  to  the  signal  service. 

The  number  of  bytes  from  a  specified  position  (orig)  in  a  file,  passed  to  the  Iseek  ser¬ 
vice. 

A  pointer  to  NULL-terminated  ASCII  string  that  contains  the  directory  path  of  the  old 
filename. 

The  address  of  the  previous  user  signal  handler  returned  by  the  signal  service. 

A  value  of  0, 1 ,  or  2  that  refers  to  the  beginning,  the  current  position,  or  the  position  of 
the  end  of  a  file. 

The  memory  page  size  in  bytes  returned  by  the  getpsize  service. 

A  pointer  to  a  NULL-terminated  ASCII  string  that  contains  the  directory  path  of  a  file¬ 
name. 

The  UNIX  file  access  permission  codes  passed  to  the  open  service. 

The  return  value  that  indicates  success  or  failure. 

The  seconds  count  returned  by  the  time  service. 

A  signal  number  passed  to  the  sendsig  service. 

The  trap  address  returned  by  the  setvec  and  settrap  services.  A  trap  address 
passed  to  and  returned  by  the  settrap  service. 

The  trap  number  passed  to  the  setvec  and  settrap  services. 

The  current  position  in  a  specified  file  returned  by  the  Iseek  service. 

The  time  zone  minutes  correction  value  returned  by  the  gettz  service. 
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2.3  C  LANGUAGE  COMPILER 


I  know  of  six  C  language  compilers  producing  code  for  the  29K  family.  The 
most  widely  used  of  these  are:  the  High  C  29K  compiler  developed  by  Metaware  Inc; 
and  GNU  supported  by  the  Free  Software  Foundation  and  Cygnus  Support  Inc.  De¬ 
velopers  of  29K  software  normally  operate  in  a  cross  development  environment, 
editing  and  compiling  code  on  one  machine  which  is  intended  to  run  on  29K  target 
hardware.  The  High  C  29K  compiler  is  sold  by  a  number  of  companies,  including 
AMD,  and  packaged  along  with  other  vendor  tools.  High  C  29K  can  produce  code 
for  both  big-  and  little-endian  29K  operation.  The  GNU  compiler,  gcc,  currently 
(version  2.5)  produces  big-endian  code.  This  does  not  present  a  problem  as  the  29K 
is  used  predominantly  in  big-endian. 

2.3.1  Compiler  Optimizations 

A  RISC  chip  is  very  sensitive  to  code  optimization.  This  is  not  surprising  since 
the  RISC  philosophy  gives  software  greater  access  to  a  processor’s  internals  relative 
to  most  CISC  processors.  Compilers  make  use  of  a  number  of  code  optimization 
techniques  which  it  is  difficult  for  the  assembly  language  programmer  to  consistently 
make  use  of.  Some  of  these  techniques  are  briefly  described  below.  For  example: 

Common  Sub-Expression  Elimination 


c=a+b; 

d=a+b;  /*  sub-expression  used  again  */ 

The  expression  a+b  is  a  common  sub-expression,  it  does  not  need  to  be  eva¬ 
luated  twice.  A  more  efficient  compiler  would  store  the  result  of  the  first  evaluation 
in  a  local  or  global  register  and  reuse  the  value  in  the  second  expression.  Temporary 
variables  used  during  interim  calculations  are  optimized  by  the  compiler.  These  com¬ 
piler-generated  temporaries  are  allocated  to  register  cache  locations. 

Strength  Reduction 

When  ever  possible  “strength  reduction”  is  performed.  This  refers  to  replacing 
expensive  instructions  with  less  expensive  ones.  For  example,  replace  multiplies  by 
factors  of  two  with  more  efficient  shift  instructions. 

Loop  Invariant  Code  Motion 

Sometimes  a  C  programmer  will  place  code  in  a  loop  which  could  have  been 
located  outside  of  the  loop.  For  example,  variable  initialization  need  not  be  repeated- 
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ly  executed  in  a  loop.  The  loop  invariant  initialization  would  be  located  before  the 
loop  code.  Hence,  the  amount  of  code  required  to  support  each  loop  iteration  is  mini¬ 
mized. 

Loop  Unrolling 

There  are  a  number  of  optimization  techniques  applicable  to  code  loops.  The 
objective  is  the  same,  to  replace  the  loop  with  a  sequence  of  faster  executing  code. 
This  often  involves  unrolling  the  loop  partially  or  completely.  For  example,  the 
compiler  may  determine  a  loop  is  traversed,  say,  three  times.  It  may  be  more  effective 
to  replace  the  loop  with  three  in-line  versions  of  the  loop.  This  would  eliminate  the 
branching  required  by  the  loop.  Additionally,  when  a  loop  is  unrolled  there  are 
generally  increased  opportunities  to  apply  optimizations  not  available  to  the  looped 
alternative.  Consequently,  sections  on  the  expanded  loop  need  not  be  just 
duplications  of  a  single  loop  iteration,  but  something  smaller  and  more  register 
efficient. 

Dead-Code  Elimination 

Code  which  can  never  be  executed  is  eliminated.  This  saves  on  memory  usage. 
Unexecutable  code  can  result  from  a  branch  which  can  never  be  taken.  Compilers 
generally  issues  a  warning  when  they  detect  “unreachable  code”.  Additionally,  result 
values  which  are  never  used  can  be  eliminated;  this  can  remove  unneeded  store 
instructions. 

Improved  Register  Allocation 

A  processor’s  registers  are  a  critical  resource  in  determining  performance. 
Accessing  registers  is  very  much  more  efficient  than  accessing  off-chip  memory. 
The  ability  of  the  compiler  to  devise  schemes  to  keep  data  within  the  available 
registers  is  critical.  Additionally,  given  that  the  29K  compiler  determines  the  size  of  a 
procedure’s  register  window,  it  is  important  to  minimize  register  allocation  if  spilling 
and  filling  are  to  be  avoided. 

Constant  Propagation  And  Folding 

Variables  are  often  assigned  constant  values.  Later,  the  variable  is  used  in  a  cal¬ 
culation.  The  29K  instruction  format  supports  8-bit  immediate  data  constants.  Ap¬ 
plying  constant  variables  as  immediate  data  rather  than  holding  the  variable  in  a  reg¬ 
ister  can  be  more  efficient.  Additionally,  propagating  an  immediate  value  may  enable 
it  to  be  combined  with  another  immediate  value  at  compile  time.  This  is  better  than 
performing  a  run-time  calculation. 

Register-to-Register  Copying  (Copy  Propagation) 

When  examining  compiler  generated  code,  particularly  if  the  target  is  a  CISC 
processor,  it  is  not  unusual  to  see  stores  of  register  data  to  memory  locations.  This 
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makes  the  register  available  for  reuse.  Later,  the  stored  data  is  reloaded  for  further 
processing.  The  better  RISC  compilers  try  to  keep  data  in  registers  longer;  and  use 
register-to-register  copying  rather  than  register-to-memory. 

Memory  Shadowing 

The  performance  impact  of  a  memory  access  is  reduced  when  the  access  is  per¬ 
formed  to  a  copy-back  data  cache.  However,  most  processors  do  not  have  this  advan¬ 
tage  available  to  them.  The  term  “memory  shadowing”  refers  to  the  increased  use  of 
registers  for  data  variable  storage.  Again,  directing  accesses  to  registers  rather  than 
off-chip  memory  has  significant  performance  advantages.  Of  course,  if  a  variable  is 
defined  volatile  it  can  not  be  held  in  a  register. 

Memory  References  Are  Coalesced  and  Aligned 

Data  memory  can  be  most  efficiently  accessed  using  burst-mode  addressing. 
This  requires  the  use  of  load-  and  store-multiple  instructions.  When  a  sufficiently 
large  data  object  is  being  moved  between  memory  and  registers,  it  is  best  to  use  the 
burst-mode  supported  instructions.  The  compiler  can  also  arrange  for  frequently  ac¬ 
cessed  data  to  be  located  (coalesced)  in  adjacent  memory  locations,  even  if  the  data 
variables  were  not  consecutively  defined. 

There  are  also  performance  benefits  to  be  had  by  aligning  target  instructions  on 
cache  block  boundaries.  For  example,  a  procedure  can  be  aligned  to  start  on  a  4-word 
boundary.  This  improves  cache  utilization  and  performance  —  particularly  with 
caches  which  do  not  support  partially  filled  cache  blocks. 

Delay  Slot  Filling 

The  compilers  perform  “delay  slot  filling”  (see  section  3.1.8).  Delay  slots  occur 
whenever  a  29K  processor  experiences  a  disruption  in  consecutive  instruction  execu¬ 
tion.  The  processor  always  executes  the  instruction  in  the  decode  pipeline  stage,  even 
if  the  execute  stage  contains  a  jump  instruction.  Delay  slot  is  the  term  given  to  the 
instruction  following  the  jump  or  conditional  branch  instruction.  Effectively,  the 
branch  instruction  is  delayed  one  cycle.  Unlike  assembly  language  programmers,  the 
compiler  easily  finds  useful  instructions  to  insert  after  branching  instructions.  These 
instructions,  which  are  executed  regardless  of  the  branch  condition,  are  effectively 
achieved  at  no  cost.  Typically,  an  instruction  that  is  invariant  to  the  branch  outcome  is 
moved  into  the  delay  slot  just  after  the  branch  or  jump  instruction. 

Jump  Optimizations 

Because  of  the  pipeline  stalling  effects  of  jump  instruction,  scheduling  these 
instructions  can  achieve  significant  performance  improvements.  The  objective  is  to 
reduce  the  number  of  taken  branches.  For  example,  code  loops  typically  have  condi¬ 
tional  tests  at  the  top  of  the  loop  to  test  for  loop  completion.  This  results  in  branch 
instructions  at  the  top  and  the  bottom  of  the  loop.  If  the  conditional  branch  is  moved 
to  the  bottom  of  the  loop  then  the  number  of  branches  is  reduced. 
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Instruction  Scheduling 

The  29K  allows  load  and  store  instructions  to  be  overlapped  with  other  instruc¬ 
tions  that  do  not  depend  on  the  load  or  store  data.  Ordinarily,  a  processor  will  load 
data  into  a  register  before  it  makes  use  of  it  in  the  subsequent  instruction.  To  enable 
overlapping  of  the  external  memory  access,  the  load  instruction  must  be  executed  at 
an  earlier  stage,  before  it  is  required.  Best  results  are  obtained  if  code  motion  tech¬ 
niques  are  used  to  push  the  load  instruction  back  by  as  many  instructions  as  there  are 
memory  access  delay  cycles  (another  name  for  this  technique  is  instruction  pre¬ 
scheduling).  This  will  prevent  processor  pipeline  stalling  caused  by  an  operand  value 
not  being  available.  Once  again,  code  motion  is  best  left  to  the  compiler  to  worry 
about. 

Leaf  Procedure  Optimization 

Leaf  procedures  are  procedures  which  do  not  call  other  procedures;  at  least  they 
do  not  contain  any  C  level  procedure  calls.  However,  they  can  contain  transparent 
routine  calls  inserted  by  the  compiler.  Because  of  this  unique  characteristic  of  leaf 
routines,  a  number  of  optimizations  can  be  applied.  For  example,  simplified 
procedure  prologue  and  epilogue,  alternative  register  usage.  When  a  leaf  is  static  in 
scope  (only  known  within  the  defining  module)  alternative  parameter  passing  and 
register  allocation  schemes  can  be  applied. 

With  newer  versions  of  the  High  C  29K  compiler,  it  is  possible  to  construct 
simple  procedures  as  transparent  routines  (see  section  3.7).  If  a  procedure  qualifies 
for  a  transparent-type  implementation,  then  its  parent  (in  the  calling  sequence)  may 
itself  become  a  leaf  procedure.  This  propagates  the  benefits  obtained  by  leaf 
procedures. 

In-lining  Simple  Functions 

The  program  may  call  a  procedure  but  the  compiler  can  replace  the  call  with 
equivalent  in-line  code.  For  very  small  procedures  this  can  be  a  performance 
advantage.  However,  as  the  called  procedure  grows  in  size  and  in-lining  is  frequently 
applied,  then  code  space  requirements  will  increase.  In-lining  is  frequently  utilized 
with  C++  code  which  often  has  classes  with  small  member  functions.  The  register 
requirements  of  a  procedure  can  grow  when  it  has  to  deal  with  in-line  code  rather 
than  a  procedure  call.  This  does  not  present  much  difficulty  for  a  29K  processor  as  it 
can  individually  tailor  the  register  allocation  to  each  procedure’s  requirements  with 
dynamically  sized  register  windows. 

As  stated  above,  it  is  possible  to  construct  simple  functions  as  transparent 
routines  (see  section  3.7).  This  is  not  really  in-lining,  but  it  does  further  reduce  the 
overhead  associated  with  even  a  leaf  procedure.  Additionally,  placing  code  in  a 
transparent  routine,  which  is  shared,  helps  reduce  the  code  expansion  which  occurs 
with  in-lining.  For  this  reason  using  the  C  language  key  word  _ Transparent  to  define 
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the  type  of  small  procedures,  may  be  a  performance  advantage  when  used  with  C++ 
object  member  functions. 

Global  Function  In-lining 

When  code  in-lining  is  applied,  it  is  typically  limited  to  functions  defined  and 
used  within  an  single  module.  More  elaborate  schemes  enable  a  function  to  be 
defined  in  one  module  and  the  related  code  to  be  inserted  in-line  even  if  the  call  to  the 
function  appears  in  another  file.  Applying  function  in-lining  in  this  global  fashion 
can  greatly  extend  the  benefits  of  in-lining. 

Two-pass  Code  Compilation 

Most  compilers  apply  their  optimization  statically.  That  is  entirely  at  compile 
time.  However,  by  observing  the  program  in  execution,  optimizations  can  be  further 
refined.  For  example,  branch  prediction  can  be  applied  statically,  but  observing  the 
frequency  of  actual  branching  reveals  the  most  traversed  code  paths.  Additionally, 
the  data  which  is  most  frequently  accessed  can  be  determined.  With  this  information 
a  second  pass  of  the  compiler  can  be  applied  and  further  code  optimizations  incorpo¬ 
rated. 

Superblock  Formation 

Software  optimizations  are  normally  only  applied  within  a  code  block.  A  block 
is  a  code  sequence  which  is  bounded  by  a  single  entry  point  (at  the  top  —  a  lower 
address)  and  one  or  more  exit  points  (a  jump  or  call  instruction).  Instruction 
scheduling  and  other  optimizations  can  be  better  utilized  if  an  instruction  block  is 
large.  For  this  reason  techniques  which  enlarge  a  block’s  size  and  create  a  superblock 
are  important 

A  superblock  may  contain  a  number  of  basic  blocks,  yet  code  optimizations  can 
be  applied  over  the  larger  superblock  code  sequence.  Creation  of  a  superblock  can 
require  duplication  of  code.  Typically  the  tail  of  a  superblock  will  be  duplicated  (tail 
recursion)  to  eliminate  side  entry  points  to  the  superblock.  Optimization  techniques 
which  help  superblock  creation  are:  loop  unrolling,  function  in-lining,  jump 
elimination,  code  duplication,  code  migration,  and  code  profiling. 

2.3.2  Metaware  High  C  29K  Compiler 

The  Metaware  Inc.  compiler,  invoked  with  the  hc29  driver,  has  held  the  position 
as  the  top  performing  29K  compiler  for  a  number  of  years.  It  generally  produces  the 
fastest  code,  which  is  of  the  smallest  size.  It  is  available  on  SUN  and  HP  workstation 
platforms  as  well  as  IBM  PC-AT  machines.  It  may  be  made  available  on  other  plat¬ 
forms  depending  on  customer  demand.  A  number  of  companies  resell  the  compiler 
along  with  other  tools,  such  as  debuggers  and  emulators. 
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The  compiler  typically  allocates  about  12  registers  for  use  by  each  new 
procedure.  However,  a  very  large  procedure  could  be  allocated  up  to  128  registers. 
This  requires  the  register-stack  cache  be  assigned  the  maximum  window  size  of  128 
registers.  The  “lregs=n”  compiler  switch  (minimum  n=36)  enables  the  maximum 
number  of  registers  allocated  to  a  procedure  to  be  limited  to  less  than  128.  If  the 
“lregs”  switch  is  used,  it  is  possible  to  operate  with  a  reduced  window  size.  This 
would  increase  the  frequency  of  stack  spilling  and  filling  (and  hence  reduce  effective 
execution  speeds)  but  would  enable  a  faster  task  context  switch  time  (see  section 
8.1.4).  The  maximum  number  of  local  registers  which  would  require  saving  or 
restoring  would  be  limited  to  the  reduced  window  size  (window  size  =  rfb-  rab). 

A  number  of  the  example  code  sequences  shown  in  this  book,  and  provided  by 
AMD,  are  configured  to  operate  with  a  fixed  window  size  of  5 12  bytes;  in  particular, 
repair_R_stack  in  file  signals  and  signal_associate  in  file  sig_code.s.  These  files 
should  be  modified  to  reflect  the  reduced  window  size.  Ideally  a  Supervisor  mode 
accessible  memory  location,  say  WindowSize,  should  be  initialized  by  the  operating 
system  to  the  chosen  window  size,  and  all  subsequent  code  should  access 
WindowSize  to  determine  the  window  size  in  use.  Additionally,  the  spill  handler 
routine  must  be  replace  with  the  code  shown  below.  The  replacement  handler 
requires  three  additional  instructions.  But,  unlike  the  more  frequently  used  spill 
handler  (section  4.4.4),  it  is  not  restricted  to  operating  with  a  fixed  window  size  of 
512  bytes. 


spill_handler : 


sub 

tav, rab, rsp 

srl 

gr96, tav, 0x2 

sub 

gr96, gr96, 0x1 

mtsr 

cr, gr96 

sub 

tav, rfb, tav 

const 

gr96, 0x200 

or 

gr96, tav, gr96 

mtsr 

ipa, gr96 

add 

rab, rsp, 0x0 

storem 

0 , 0x0 , grO , tav 

jmpi 

tpc 

add 

rfb, tav, 0x0 

; calculate  size  of  spill 

; number  of  words 
; determine  new  rfb  position 

;point  into  register  file 
; adjust  rab  position 
;move  data 

; adjust  rfb  position 


The  above  spill  handler  code  may  fail  if  there  is  a  procedure  which  does  not  use 
the  gr96  register.  The  compiler  may  hold  a  value  in  gr96  and  expect  it  to  survive  the 
function  call;  and  the  function  call  may  result  in  spill  handler  execution.  This  is  not 
likely,  but  the  use  of  gr96  above  must  be  done  with  care. 

A  number  of  non-standard  C  features  have  been  added  to  the  compiler.  These 
features  are  often  useful,  but  their  use  reduces  the  portability  of  code  between 
different  C  compilers.  For  example,  the  High  C  29K  compiler  does  not  normally  pack 
data  structures.  The  type  modifier  _Packed  can  be  used  to  specify  packing  on  a 
per-structure  bases.  If  structure  packing  is  selected  on  the  compiler  command  line, 
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unpacked  structures  can  be  selectively  specified  with  the  JUnpacked  type  modifier. 
For  example: 

typedef  _Packed  struct  packet_str  /*  packed  structure  */ 

{  char  A; 

int  B; 

}  packet_t; 

A  HIF  conforming  operating  systems  provides  unaligned  memory  access  trap 
handlers  —  any  29K  operating  system  may  choose  to  do  this.  Hence,  if  an  object  larg¬ 
er  than  a  byte  is  accessed  and  the  object  is  not  aligned  to  an  object-sized  boundary, 
then  a  trap  will  be  taken  and  the  trap  handler  will  perform  the  required  access  in 
stages  if  necessary.  The  trap  handler  will  require  several  processor  cycles  to  perform 
its  task.  To  the  programmer,  the  required  data  is  accessed  as  if  it  were  aligned  on  the 
correct  address  boundary.  In  the  example  above,  structure  member  B  is  of  size  but 
is  not  aligned  on  a  int-sized  boundary  (given  object  A  is  a  char  and  it  is  aligned  on  a 
word-sized  boundary). 

Of  course  there  is  a  performance  penalty  for  use  of  trap  handlers.  For  this  reason, 
packed  data  structures  are  seldom  used.  However,  there  use  does  reduce  data 
memory  requirements,  and  for  this  reason  data  is  often  sent  between  processors  in 
packed  data  packets.  When  a  data  packet  is  received,  its  contents  can  be  accessed  as 
bytes  without  any  data  alignment  difficulties.  Access  of  data  larger  than  bytes  may 
require  unaligned  trap  handler  support,  and  thus  suffer  a  performance  penalty. 

The  High  C  29K  compiler  offers  a  solution  to  the  performance  problem  with  the 
type  modifiers  _ASS UME_ALIGNED  and  _ASS UME_UN ALIGNED .  They  enable  a 
pointer  to  a  unaligned  structure  to  be  declared.  For  example: 

receive_packet (packet_p) 

_AS S UME_UNAL I GNE D  packet_t*  packet_p; 

{ 

int  data  =  packet_p->B; /*  unaligned  access  */ 

The  receive_packet()  procedure  is  passed  a  pointer  to  a  data  structure  which  is 
known  to  be  unaligned.  Normally,  when  member  B  of  the  packet  structure  is 
accessed,  an  unaligned  trap  occurs.  However,  informing  the  compiler  of  the 
unaligned  nature  of  the  data  enables  the  compiler  to  replace  the  normal  load 
instruction  used  to  read  the  B  data  with  a  transparent  helper  routine  call  (see  section 
3.7).  The  transparent  helper  routine  performs  the  same  task  as  the  trap  handler  but 
with  a  reduced  overhead. 

2.3.3  Free  Software  Foundation,  GCC 

The  GNU  compiler,  gcc,  can  be  obtained  from  any  existing  users  who  are  in  a 
position,  and  has  the  time,  to  duplicate  their  copy.  Alternatively,  the  Free  Software 
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Foundation  can  be  contacted.  For  a  small  fee,  Cygnus  Support  Inc.  will  ship  you  a 
copy  along  with  their  documentation.  The  GNU  compiler  is  available  in  source  form, 
and  currently  runs  on  UNIX  type  host  machines  as  well  as  386  based  IBM  PCs  and 
compatibles. 

Considering  the  Stanford  University  benchmark  suite,  the  gcc  compiler  (ver¬ 
sion  2.3)  produces  code  which  is  on  average  15-20%  slower  in  execution  compared 
to  hc29.  The  GNU  compiler  also  used  considerably  more  memory  to  contain  the 
compiled  code.  Of  course  your  application  program  may  experience  somewhat  dif¬ 
ferent  results. 

2.3.4  C++  Compiler  Selection 

Programmers  first  started  developing  C++  code  for  the  29K  in  1988;  they  used 
the  AT&T  preprocessor,  cfront,  along  with  the  High  C  29K  compiler.  A  number  of 
support  utilities  were  developed  at  that  time  to  enable  the  use  of  cfront :  nm29, 
munch29,  and  szal29,  which  gave  the  size  and  alignment  of  29K  data  objects  (re¬ 
quired  for  cross  development  environments). 

Because  the  GNU  tool  chains  can  support  C++  code  development  directly  with 
the  the  GCC  compiler  there  is  little  use  being  made  of  the  AT&T  cfront  preprocessor. 
Additionally,  MRI  and  Metaware  have  recently  announced  upgrades  to  their  prod¬ 
ucts  which  now  enable  C++  code  development.  (C++  makes  extensive  use  of  dynam¬ 
ic  memory  resources,  see  section  2.4.1.) 

2.3.5  Executable  Code  and  Source  Correspondence 

The  typically  high  levels  of  optimization  applied  by  a  compiler  producing  code 
for  RISC  execution,  can  make  it  difficult  to  identify  the  relationship  between  29K 
instructions  and  the  source  level  code.  When  looking  at  the  29K  instructions 
produced  by  the  compiler,  it  is  not  always  easy  to  identify  the  assembly  instructions 
which  correspond  to  each  line  of  C  code.  Optimizations  such  as:  code  motion, 
sub-expression  elimination,  loop  unrolling,  instruction  scheduling  and  more,  all  add 
to  the  difficulty. 

Fortunately,  there  is  usually  little  need  to  study  the  resulting  instructions 
produced  after  compilation.  However,  it  can  occasionally  be  worth  studying 
compiler  output  when  trying  to  understand  the  performance  of  critical  code 
segments.  It  is  difficult  to  obtain  a  small  example  of  C  code  which  demonstrates  all 
the  potential  code  optimizations.  The  example  below  is  interesting,  but  illustrates 
only  a  few  of  the  difficulties  of  relating  source  code  to  29K  instructions. 

int  strcmp(sl,  s2)  /*  file  strcmp.c  */ 

char  *sl , *s2 ; 

{ 


Chapter  2  Applications  Programming 


113 


int  cnt=0; 

for  (cnt  =  0; / cnt++) ; 

{  if (si [cnt] ! =s2 [cnt] ) 

return  -1; 

if (si [cnt] ==' \0'  ||  s2 [cnt ] ==' \0' )  /*  line 

if (si [cnt] ==' \0'  &&  s2 [cnt ] ==' \0' ) 
return  0; 

else 


return  -1; 


*/ 


}  /*  line  14  */ 

The  procedure,  strcmpO,  is  similar  to  the  ANSI  library  routine  of  the  same 
name.  It  is  passed  the  address  of  two  strings.  The  strings  are  compared  to  determine  if 
they  are  the  same.  If  they  are  the  same,  zero  is  returned,  otherwise  -1  is  returned.  This 
is  not  exactly  the  same  behavior  as  the  ANSI  routine. 

The  procedure  is  based  on  a  for-loop  statement  which  compares  characters  in 
the  two  strings  until  they  are  found  to  be  different  or  one  of  the  strings  is  terminated. 
The  algorithm  used  by  the  C  code  is  not  optimal.  But  this  makes  the  example  more 
interesting  as  it  challenges  the  compiler  to  produce  the  minimum  code  sequence.  The 
Metaware  compiler  was  first  used  to  compile  the  code  with  a  high  level  of 
optimization  selected  (-07).  The  command  line  use  was  “hc29  -S  -Hanno  -07 
strcmp.c”.  The  “-S”  switch  causes  the  compiler  to  stop  after  it  has  produced  29K 
assembly  code  —  no  linking  with  libraries  is  performed.  The  “-Hanno”  switch 
causes  the  source  C  code  to  be  embedded  in  the  output  assembly  code.  This  helps 
identify  the  assembly  code  corresponding  to  each  line  of  C  code.  The  assembly  code 
produced  is  shown  below.  Note  that  some  assembly  level  comment  statements  have 
been  added  to  help  explain  the  code  operation. 


.  text 

.word  0x40000  ;  Tag:  argcnt=2  msize=0 
.global  _strcmp 


_strcmp : 

;4  I  int  cnt=0; 

;5  I  for (cnt=0; ; cnt++) 
jmp  L2 

const  gr 97 , 0 

L3  : 

L2  : 


; cnt=0 

;top  of  for-loop 


/  6 


I  ( 
add 
load 
add 
load 
cpeq 
jmpf 
cpeq 
I 

jmpt 

cpneq 


if ( si [cnt ] ! =s2 [ cnt ] ) 
gr96, lr2, gr97 

0, 1, gr99, gr96  ; load  si [cnt] 

gr96, lr3, gr97 

0, 1, gr98, gr96  ; load  s2[cnt] 

gr96, gr99, gr98  /compare  characters 

gr96,L4  ;  jump  if  different 

gr96,gr99,0  /test  if  si [cnt]  ==  '\0' 

if (si [cnt] ==' \0'  M  s2 [cnt ] ==' \0' ) 
gr96,L5  /  jump  if  string  end 

gr96,gr98,0  /test  s2[cnt] 
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jmpt 

gr96, L3 

; for-loop  if  not  end 

add 

gr97, gr97, 1 

; increment  cnt 

L5  : 

;  9 

1 

if  ( si [cnt ] ==' 

'  \0'  &&  s2 [cnt] ==' \0' ) 

cpneq 

gr96, gr99, 0 

;here  is  at  end  of  string 

jmpt 

cpneq 

gr96, L4 
gr96, gr98, 0 

; j ump  if  s 1 [ ]  ! = ' \ 0 ' 

;  10 

jmpt 

constn 

1 

gr96, L7 
gr96, -1 

; jump  if  s2 [ ] ! =' \0' 

return  0; 

jmpi 

IrO 

; strings  match 

const 

gr96, 0 

; return  0 

L4  : 

constn 

gr96, -1 

; no  match 

L7  : 

;  12 

1 

jmpi 

nop 

IrO 

return  -1; 

The  body  of  the  for-loop  is  contained  between  address  labels  L3  and  L5.  The 
compiler  has  filled  the  delay  slot  of  jump  instructions  with  other  useful  instructions. 
Within  the  for-loop,  LOAD  instruction  are  used  to  access  the  characters  of  each 
string.  Register  gr97  is  used  to  hold  the  loop-count  value,  cnt.  The  count  value  is 
incremented  each  time  round  the  for-loop.  The  value  in  gr97  is  added  to  the  base  of 
each  string  ( lr2  and  lr3)  to  obtain  the  address  of  each  character  required  for  compari¬ 
son.  The  LOAD  instructions  have  been  scheduled  to  somewhat  reduce  conflict  for 
off-chip  access  and  reduce  the  pipeline  stalling  affects  of  LOAD  instructions. 

Within  the  body  of  the  loop  three  tests  are  applied:  one  to  determine  if  the  char¬ 
acters  at  the  current  position  in  the  string  match;  the  remaining  two,  to  determine  if 
the  termination  character  has  been  reached  for  either  of  the  strings.  The  assembly 
code  after  label  L5  selects  the  correct  return  value  when  the  tested  characters  do  not 
match  or  string  termination  is  reached.  There  is  unnecessary  use  of  jump  instructions 
in  the  code  following  label  L5  and  also  in  the  initial  code  jumping  to  label  L2.  It  is 
somewhat  fortunate  that  this  less  optimal  code  does  not  appear  within  the  more  fre¬ 
quently  executed  for-loop  body. 

The  same  code  was  compiled  with  the  GNU  compiler  using  command  “gee  -S 
-04  strcmp.c”.  The  assembly  code  produced  is  shown  below;  it  is  quite  different 
from  the  Metaware  produced  code. 


.  text 
.align  4 
.global  _strcmp 
.word  0x40000 
_strcmp : 

L2  : 


load  0, 1, grll7, lr2 
load  0, 1, grll6, lr3 
cpneq  grl 1 6, grl 17 , grl 1 6 
jmpf  grll6,L5 


;top  of  for-loop 
;  load  si  [ cnt ] 

; load  s2 [ cnt ] 

; compare  characters 
; jump  if  match 
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cpneq  grll6, grll7, 0 
jmpi  IrO 

constn  gr96, 65535 

L5  : 

jmpfi  grll6,lr0 
const  gr96, 0 
add  lr3, lr3, 1 
jmp  L2 

add  lr2 , lr2 , 1 


;test  for  sl[]  end 
;  no  match 
;  return  -1 

;here  if  si [cnt ] ==s2 [cnt ] 
; return  if  at  string  end 

; next  s2 [ ]  character 
; for-loop 

; next  s 1 [ ]  character 


All  of  the  code  is  contained  in  the  body  of  the  for-loop.  A  for-loop  transition 
consists  of  10  instructions,  a  decrease  of  one  compared  to  the  Metaware  code.  How¬ 
ever,  LOAD  instructions  are  now  placed  back-to-back,  and  loaded  data  is  used  im¬ 
mediately.  Additionally,  the  normal  path  through  the  for-loop  contains  an  additional 
jump  to  label  L5 .  This  will  increase  the  actual  number  of  cycles  required  to  execute  a 
single  for-loop  to  more  than  10  cycles.  It  is  likely  the  Metaware  produced  code  will 
execute  in  a  shorter  time. 

No  register  (previously  gr97)  is  used  to  contain  the  cnt  value.  The  pointers  to  the 
passed  strings,  lr2  and  lr3,  are  advanced  to  point  to  the  next  character  within  the  for- 
loop.  Delay  slot  instructions  are  productively  filled  and  there  are  no  unnecessary 
jump  instructions. 

Lines  8  through  12  of  the  source  code  are  only  applied  if  the  tested  characters  are 
found  not  to  match.  Consequently,  it  is  redundant  to  test  if  either  string  has  reached 
the  termination  character  —  if  one  has,  they  both  have.  This  optimization  should 
have  been  reflected  in  the  source  code.  However,  the  GNU  compiler  has  identified 
that  it  need  only  test  string  sl[]  for  termination.  This  results  in  the  elimination  of  29K 
instructions  relating  to  later  C  code  lines.  For  example,  there  is  no  code  relating  to  the 
if-statement  on  line  9.  If  an  attempt  is  made  to  place  a  breakpoint  on  source  line  9 
using  the  GDB  source  level  debugger,  then  no  breakpoint  will  be  installed.  Other  de¬ 
buggers  may  give  a  warning  message  or  place  a  breakpoint  at  the  first  line  before  or 
after  the  requested  source  line. 

Programmers  familiar  with  older  generation  compilers  applied  to  CISC  code 
generation  will  notice  the  increased  complexity  in  associating  29K  instructions  to 
source  C  statements  —  even  for  the  simple  example  shown.  As  procedures  become 
larger  and  more  complex,  code  association  become  increasingly  more  difficult.  The 
quality  of  29K  code  produced  by  the  better  compilers  available,  make  it  very  difficult 
to  consistently  (or  frequently)  produce  better  code  via  hand  crafting  29K  instruc¬ 
tions.  Because  of  the  difficulty  of  understanding  the  compiler  generated  code,  it  is 
best  to  only  incorporate  hand-built  code  as  separate  procedures  which  comply  with 
the  C  language  calling  convention. 


2.3.6  Linking  Compiled  Code 

After  application  code  modules  have  been  compiled  or  assembled,  they  must  be 
linked  together  to  form  an  executable  file.  There  are  three  widely  used  linker  tools: 
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Microtec  Research  Inc.  developed  ld29;  Information  Processing  Corp.  developed 
ld29i;  and  the  GNU  tool  chain  offers  gld.  Sometimes  these  tools  are  repackaged  by 
vendors  and  made  available  under  different  names.  They  all  operate  on  AMD  COFF 
formatted  files.  However,  they  each  have  different  command  line  options  and  link 
command-file  formats.  A  further  limitation  when  mixing  the  use  of  these  tools  is  that 
ld29  operates  with  a  different  library  format  compared  to  the  others.  It  uses  an  MRI 
format  which  is  maintained  by  the  lib29  tool.  The  others  use  a  UNIX  System  V  for¬ 
mat  supported  by  the  well  known  ar  librarian  tool. 

It  is  best  to  drive  the  linker  from  the  compiler  command  line,  rather  than  invok¬ 
ing  the  linker  directly.  The  compiler  driver  program,  gcc  or  hc29  for  example,  can 
build  the  necessary  link  command  file  and  include  the  necessary  libraries.  This  is  the 
ideal  way  to  link  programs,  even  if  assembly  language  modules  are  to  be  named  on 
the  compiler  command  line.  Note  that  the  default  link  command  files  frequently  use 
aligns  text  (ALIGN  .text=8192)  and  data  sections  to  8k  (8192)  byte  boundaries.  This 
is  because  the  OS-boot  operating  system  (see  Chapter  7)  normally  operates  with  ad¬ 
dress  translation  turned  on.  The  maximum  (for  the  Am29000  processor)  page  size  of 
8k  bytes  is  used  to  reduce  run-time  Memory  Management  Unit  support  overheads. 

Different  29K  evaluation  boards  can  have  different  memory  maps.  AMD  nor¬ 
mally  supplies  the  High  C  29K  linker  in  a  configuration  which  produces  a  final  code 
image  linked  for  a  popular  evaluation  board  —  many  boards  share  the  same  memory 
map.  Additionally,  AMD  supplies  linker  command  files  for  currently  available 
boards,  such  as  the  EZ030  and  SA29200  boards.  The  linker  command  files  are  lo¬ 
cated  in  the  installation/ lib  directory;  each  command  file  ends  with  the  file  extension 
.cmd.  For  example,  the  mentioned  boards  have  command  files:  ez030.cmd  and 
sa200.cmd,  respectively.  The  linker  command  files  can  be  specified  when  the  com¬ 
piler  is  invoked.  For  example,  the  command  “hc29  -o  file  -cmdez030.cmd  file.c” 
will  cause  the  final  image  to  be  linked  using  the  ez030.cmd  command  file.  Using  the 
supplied  linker  command  files  is  a  convenient  way  to  ensure  a  program  is  correctly 
linked  for  the  available  memory  resources. 

The  GNU  compiler  also  allows  options  to  be  passed  to  the  linker  via  the 
“-Xlinker”  flag.  For  example,  the  command  line  “gcc  -XI inker  -c  -XI inker 
ez030.cmd  -o  file  file.c”  will  compile  and  link  file.c.  The  linker  will  be  passed  the 
option  “— c  ez030.cmd”.  The  GNU  linker  documentation  claims  the  linker  can 
operate  on  MRI  formatted  command  files.  In  practice,  at  least  for  the  29K,  this  is  not 
the  case.  The  GNU  linker  expects  MRI-MC68000  formatted  command  files,  which 
are  a  little  different  from  MRI-29K  formatted  command  files.  Known  differences  are 
the  use  of  the  character  rather  than  “#”  before  comments,  and  the  key  word 
PUBLIC  must  be  upper  case.  Those  using  the  GNU  tool  chain  generally  prefer  to  use 
the  GNU  linker  command  file  syntax  rather  than  attempt  to  use  the  AMD  supplied 
command  files. 
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When  developing  software  for  embedded  applications  there  is  always  the  prob¬ 
lem  of  what  to  do  with  initialized  data  variables.  The  problem  arises  because  vari¬ 
ables  must  be  located  in  RAM,  but  embedded  programs  are  typically  not  loaded  by  an 
operating  system  which  prepares  the  data  memory  locations  with  initialized  values. 
Embedded  programs  are  stored  in  ROM;  this  means  there  is  no  problem  with  pro¬ 
gram  instructions  unless  a  program  wishes  to  modify  its  own  code  at  run-time. 

Embedded  system  support  tools  typically  provide  a  means  of  locating  initial¬ 
ized  data  in  ROM;  and  transferring  the  ROM  contents  to  RAM  locations  before  pro¬ 
gram  execution  starts.  The  High  C  29K  linker,  ld29,  provides  the  INITDATA  com¬ 
mand  for  this  purpose.  Programs  must  be  linked  such  that  all  references  to  writeable 
data  occur  to  RAM  addresses.  The  INITDATA  scans  a  list  of  sections  and  transfers 
the  data  variables  found  into  a  new  .initdat  section.  The  list  contains  the  names  of 
sections  containing  initialized  data.  The  linker  is  then  directed  to  locate  the  new  .init- 
data  section  in  ROM. The  start  address  of  the  new  section  is  marked  with  symbol 
initdat. 

Developers  are  provided  with  the  source  to  a  program  called  initcopyO  which 
must  be  included  in  the  application  program.  This  program  accesses  the  data  in  ROM 
starting  at  label  initdat  and  transfers  the  data  to  RAM  locations.  The  format  of  the 
data  located  in  the  .initdat  section  is  understood  by  the  initcopyO  routine.  This  rou¬ 
tine  must  be  run  before  the  application  main()  program.  A  user  could  place  a  call  to 
the  initialization  routine  inside  crtO.s. 

Note,  because  initcopyO  must  be  able  to  read  the  appropriate  ROM  devices, 
these  devices  must  be  placed  in  an  accessible  address  space.  This  is  not  a  problem  for 

2- bus  members  of  the  29K  family,  but  3-bus  members  can  have  a  problem  if  the  .in¬ 
itdat  section  is  located  in  a  ROM  device  along  with  program  code.  Processors  with 

3- bus  architectures,  such  as  the  Am29000,  have  separately  addressed  Instruction 
and  ROM  spaces  which  are  used  for  all  instruction  accesses.  The  Am29000  proces¬ 
sor  has  no  means  of  reading  these  two  spaces  to  access  data  unless  an  external  bridge 
is  provided.  If  program  code  and  initialized  data  are  located  in  the  same  ROM  device, 
the  initcopyO  program  can  only  be  used  if  an  external  bridge  is  provided.  This  bridge 
connects  the  Am29000  processor  data  memory  bus  to  the  instruction  memory  bus.  If 
a  3-bus  system  does  not  have  a  bridge  the  romcoff  utility  can  be  used  to  initialize  data 
memory. 

The  romcoff  utility  can  be  used  when  the  ld29  linker  is  not  available  and  the 
INITDATA  linker  command  option  is  not  provided.  Besides  being  able  to  work  with 
3-bus  architectures  which  have  no  bridge,  it  can  be  used  to  process  program  sections 
other  than  just  initialized  data.  Sections  which  ultimately  must  reside  in  RAM  can  be 
initialized  from  code  located  in  ROM. 

Fully  linked  executables  are  processed  by  romcoff  to  produce  a  new  linkable 
COFF  file.  This  new  module  has  a  section  called  RI_text  which  contains  a  routine 
called  RAMInit().  When  invoked,  this  routine  initializes  the  processed  sections, dur- 
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ing  preparation  of  the  relevant  RAM  regions.  The  new  COFF  file  produced  by  romc- 
off  must  be  relinked  with  the  originally  linked  modules.  Additionally,  a  call  to  RA- 
MInit()  must  be  placed  in  crtO.s  or  in  the  processor  boot-up  code  (cold-start  code)  if 
the  linked  executable  is  intended  to  control  the  processor  during  the  processor  RE¬ 
SET  code  sequence. 

When  romcoff  is  not  used  with  the  “-r”  option,  it  assumes  that  the  ROM 
memory  is  not  readable.  This  results  in  a  RAMInit()  function  which  uses  CONST 
and  CONSTH  instructions  to  produce  the  data  values  to  be  initialized  in  RAM.  This 
results  in  extra  ROM  memory  requirements  to  contain  the  very  much  larger  RAMI- 
nit()  routine,  but  ensures  that  3-bus  architectures  which  do  not  incorporate  a  bridge 
can  initialize  their  RAM  memory. 

2.4  LIBRARY  SUPPORT 

2.4.1  Memory  Allocation 

The  HIF  specification  requires  that  conforming  operating  systems  maintain  a 
memory  heap.  An  application  program  can  acquire  memory  during  execution  by  us¬ 
ing  the  mallocO  library  routine.  This  routine  makes  use  of  the  underlying  sysalloc 
HIF  service.  The  malloc()  call  is  passed  the  number  of  consecutive  memory  bytes 
required;  it  returns  a  pointer  to  the  start  of  the  memory  allocated  from  the  heap. 

Calls  to  mallocO  should  be  matched  with  calls  to  library  routine  free().  This 
routine  is  passed  the  start  address  of  the  previously  allocated  memory  along  with  the 
number  of  bytes  acquired.  The  free()  routine  is  supported  by  the  sysfree  HIF  service. 
The  HIF  specification  states  “no  dynamic  memory  allocation  structure  is  implied  by 
this  service”.  This  means  the  sysfree  may  do  nothing;  in  fact,  this  service  with  OS- 
boot  (version  0.5)  simply  returns.  Continually  using  memory  without  ever  releasing 
it  and  thus  making  it  reusable,  will  be  a  serious  problem  for  some  application  pro¬ 
grams,  in  particular  C++  which  frequently  constructs  and  destructs  objects  in  heap 
memory. 

For  this  reason  the  library  routines  which  interface  to  the  HIF  services  perform 
their  own  heap  management.  The  first  call  to  mallocO  results  in  a  sysalloc  HIF  re¬ 
quest  for  8k  bytes,  even  in  the  mallocO  was  for  only  a  few  bytes.  Further  mallocO 
calls  do  not  result  in  a  sysalloc  request  until  the  8k  byte  pool  is  used  up.  Calls  to  free() 
enable  previously  allocated  memory  to  be  returned  to  the  pool  maintained  by  the  li¬ 
brary. 

The  alloca()  library  routine  provides  a  means  of  acquiring  memory  from  the 
memory  stack  rather  than  the  heap.  A  pointer  to  the  memory  region  within  the  calling 
procedure’s  memory  stack  frame,  is  returned  by  alloca().  The  advantage  of  this 
method  is  that  there  is  no  need  to  call  a  corresponding  free  routine.  The  temporary 
memory  space  is  automatically  freed  when  the  calling  procedure  returns.  Users  of  the 
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alloca()  service  must  be  careful  to  remember  the  limited  lifetime  of  data  objects 
maintained  on  the  memory  stack.  After  returning  from  the  procedure  calling  alloca() , 
all  related  data  variables  cease  to  exist  and  should  not  be  referenced. 


2.4.2  Setjmp  and  Longjmp 

The  setjmpO  and  longjmpO  library  routines  provide  a  means  to  jump  from  the 
current  procedure  environment  to  a  previous  procedure  environment.  The  setjmpO 
routine  is  used  to  mark  the  position  which  a  longjmpO  will  return  to.  A  call  to 
setjmpO  is  made  by  a  procedure,  passing  it  a  pointer  to  an  environment  buffer,  as 
shown  below: 


int  setjmp (env) 
jmp_buf  env; 


The  buffer  definition  is  shown  below.  It  records  the  value  of  register  stack  and 
memory  stack  support  registers  in  use  at  the  time  of  the  setjmpO  call.  The  setjmpO 
call  returns  a  value  zero. 

typedef  struct  jmp_buf_str 
{  int*  grl; 

int*  msp; 

int*  IrO; 

int*  lrl ; 

}  *jmp_buf; 


The  setjmpO  routine  is  very  simple.  It  is  listed  below  to  assist  with  the  under¬ 
standing  of  the  longjmpO  routine.  It  is  important  to  be  aware  that  setjmpO, 
longjmpO,  SPILL  and  FILL  handlers,  along  with  the  signal  trampoline  code  (see 
section  2.5.3)  form  a  matched  set  of  routines.  Their  operation  is  interdependent.  Any 
change  to  one  may  require  changes  to  the  others  to  ensure  proper  system  operation. 


set  jmp : 


store 

0,0, grl, 

lr2 

add 

lr2 , lr2 , 

4 

store 

0,0, msp, 

lr2 

add 

lr2 , lr2 , 

4 

store 

0,0, IrO , 

lr2 

add 

lr2 , lr2 , 

4 

store 

0,  0, lrl, 

lr2 

jmpi 

IrO 

const 

gr96, 0 

;  lr2  points  to  buffer 
; copy  grl  to  buffer 

; copy  msp 

; copy  IrO 

; copy  lrl 
;  return 


When  longjmpO  is  called  it  is  passed  a  pointer  to  an  environment  buffer  which 
was  initialized  with  a  previous  setjmpO  call.  The  longjmpO  call  does  not  return  di¬ 
rectly.  It  does  return,  but  as  the  corresponding  setjmpO  establishing  the  buffer  data. 
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The  longjmpO  return-as-setjmp()  can  be  distinguished  from  a  setjmpO  return  as 
itself,  because  the  longjmpO  appears  as  a  setjmpO  return  with  a  non-zero  value.  In 
fact  the  value  parameter  passed  to  longjmpO  becomes  the  setjmpO  return  value.  A 
C  language  outline  for  the  longjmpO  routine  is  shown  below: 

void  long jmp (env,  value) 
jmp_buf  env; 
int  value) 

{ 

grl  =  env->grl; 
lr2addr  =  env->grl  +  8; 
msp  =  env->msp; 

/*  saved  lrl  is  invalid  if  saved  lr2address  >  rfb  */ 
if  ( lr2addr  >  rfb)  { 

/* 

*  None  of  the  registers  are  useful . 

*  Set  rfb  to  lr2address-512  &  rab  to  rfb-512 

*  the  FILL  assert  will  take  care  of  filling 
*/ 

lrl  =  env->lrl; 

rab  =  lr2addr  -  WindowSize; 

rfb  =  lr2addr; 

} 

IrO  =  env->lrO ; 
if  (rfb  <  lrl) 

raise  V_FILL; 
return  value; 

} 


The  actual  longjmpO  routine  code,  shown  below,  is  written  in  assembly  lan¬ 
guage.  This  is  because  the  sequence  of  modifying  the  register  stack  support  registers 
is  very  important.  An  interrupt  could  occur  during  the  longjmpO  operation.  That  in¬ 
terrupt  may  require  a  C  language  interrupt  handler  to  run.  The  signal  trampoline  code 
is  required  to  understand  all  the  possible  register  stack  conditions,  and  fix-up  the 
stack  support  registers  to  enable  further  C  procedure  call  to  be  made. 


long jmp : 


load 

0,0, tav, lr2 

add 

gr 97 , lr2, 4 

cpeq 

gr96, lr3, 0 

srl 

gr96, gr96, 31 

or 

gr96, lr3, gr96 

add 

gr 1 , tav, 0 

add 

tav, tav, 8 

load 

0,0, msp, gr 97 

cpleu 

gr99, tav, rfb 

jmpt 

gr99, $1 

add 

gr97, gr97, 4 

add 

gr98, gr97, 4 

load 

0 ,  0 ,  lr 1 , gr 98 

;grl  =  env->grl 

;gr97  now  points  to  msp 

;test  return  "value",  it  must 

;  be  non  zero 

;gr96  has  return  value 

;  grl  =  env->grl; 

; lr2address  =env->grl+8 

;msp  =  env->msp 

;if  (lr2address  >  rfb) 


;gr97  points  to  IrO 
;gr98  points  to  lrl 
; lrl  =  value  from  jmpbuf 
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sub 

gr99, rfb, rab 

;gr99  has  WindowSize 

sub 

rab, tav, gr99 

; rab  =  lr2address-WindowSize 

add 

rfb, tav, 0 

; rfb  =  lr2address 

load 

0 ,  0 ,  lr 0 , gr 97 

; IrO  =  env->lrO 

jmpi 

IrO 

; return 

asgeu 

V _ FILL, rfb, lr 1 

; if  (rfb  <  lrl)  raise  V_FILL 
;  may  fill  from  rfb  to  lrl 

2.4.3  Support  Libraries 

The  GNU  tool  chain  is  supported  with  a  single  library,  libc.a.  However  the  High 
C  29K  tool  chain  is  supported  with  a  range  of  library  options.  It  is  best  to  use  the  com¬ 
piler  driver,  hc29,  to  select  the  appropriate  library.  This  avoids  having  to  master  the 
library  naming  rules  and  build  linker  command  files. 

The  GNU  libraries  do  not  support  word-sized-access-only  memory  systems. 
Originally,  the  Am29000  processor  could  not  support  byte-sized  accesses  and  all 
memory  accesses  were  performed  on  word  sized  objects.  This  required  read- 
modify-write  access  sequences  to  manipulate  byte  sized  objects  located  in  memory. 
Because  all  current  29K  processors  support  byte-size  access  directly,  there  is  no  need 
to  have  specialized  libraries  for  accessing  bytes.  However,  the  High  C  29K  tool  chain 
still  ships  the  old  libraries  to  support  existing  (pre-Rev  D,  1990)  Am29000  proces¬ 
sors. 

The  hc29  driver  normally  links  with  three  libraries:  the  ANSI  standard  C  sup¬ 
port  library  (libs*. lib),  the  IEEE  floating-point  routine  library  (libieee.lib),  and  the 
HIF  system  call  interface  library  (libos.lib).  There  are  actually  eight  ANSI  libraries. 
The  driver  selects  the  appropriate  library  depending  on  the  selected  switches.  The 
reason  for  so  many  libraries  is  due  to  the  support  of  the  old  word-only  memory  sys¬ 
tems,  the  option  to  talk  with  an  Am29027  coprocessor  directly,  and  finally,  the  option 
to  select  Am29050  processor  optimized  code. 

The  ANSI  library  includes  transcendental  routines  (sin(),  cos(),  etc.)  which 
were  developed  by  Kulus  Inc.  These  routines  are  generally  faster  than  the  transcen¬ 
dental  routines  developed  by  QTC  Inc.,  which  were  at  one  time  shipped  with  High  C 
29K.  The  QTC  transcendentals  are  still  supplied  as  the  libq*.lib  libraries,  and  must 
now  be  explicitly  linked.  The  Kulus  transcendentals  also  have  the  advantage  in  that 
they  support  double  and  single  floating-point  precision.  The  routines  are  named 
slightly  differently,  and  the  compiler  automatically  selects  the  correct  routine  de¬ 
pending  on  parameter  type.  The  GNU  libraries  (version  2.1)  include  the  QTC  tran¬ 
scendental  routines. 

Most  29K  processors  do  not  support  floating-point  instructions  directly  (see 
section  3.1.7).  When  a  non-implemented  floating-point  instruction  is  encountered, 
the  processor  takes  a  trap,  and  operating  system  routines  emulate  the  operation  in 
trapware  code.  If  a  system  has  an  Am29027  floating-point  coprocessor  available, 
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then  the  trapware  can  make  use  of  the  coprocessor  to  achieve  faster  instruction 
emulation.  This  is  generally  five  times  faster  than  software  based  emulation.  Keep¬ 
ing  the  presence  of  the  Am29027  coprocessor  hidden  in  operating  system  support 
trapware,  enables  application  programs  to  be  easily  moved  between  systems  with 
and  without  a  coprocessor. 

However,  an  additional  (about  two  times)  speed-up  can  be  achieved  by  applica¬ 
tion  programs  talking  to  the  Am29027  coprocessor  directly,  rather  than  via  trapware. 
When  the  High  C  29K  compiler  is  used  with  the  “-29027”  or  “-f027”  switches,  inline 
code  is  produced  for  floating-point  operations  which  directly  access  the  coprocessor. 
Unfortunately  the  compiled  code  can  not  be  run  on  a  system  which  has  no  coproces¬ 
sor.  The  ANSI  standard  C  support  libraries  also  support  inline  Am29027  coprocessor 
access  with  the  libs*7.1ib  library.  When  instructed  to  produce  direct  coprocessor  ac¬ 
cess  code,  the  compiler  also  instructs  the  linker  to  use  this  library  in  place  of  the  stan¬ 
dard  library,  libs*0.1ib. 

The  Am29050  processor  supports  integer  multiply  directly  in  hardware  rather 
than  via  trapware.  It  also  supports  integer  divide  via  converting  operands  to  floating¬ 
point  before  dividing  and  converting  back  to  integer.  The  High  C  29K  compiler  per¬ 
forms  integer  multiply  and  divide  by  using  transparent  helper  routines  (see  section 
3.7);  this  is  faster  than  the  trapware  method  used  by  the  GNU  compiler.  When  the 
High  C  29K  compiler  is  used  with  the  “-29050”  switch,  and  the  GNU  compiler  with 
the  “-m29050”  switch,  code  optimized  for  the  use  for  an  Am29050  processor  is 
used.  This  code  may  not  run  on  other  29K  family  members,  as  the  Am29050  proces¬ 
sor  has  some  additional  instructions  (see  sections  3.1.6  and  3.1.7). 

2.5  C  LANGUAGE  INTERRUPT  HANDLERS 

Embedded  application  code  developers  typically  have  to  deal  with  interrupts 
from  peripheral  devices  requiring  attention.  As  with  general  code  development  there 
is  a  desire  to  deal  with  interrupts  using  C  language  code  rather  than  assembly  lan¬ 
guage  code.  Compared  to  CISC  type  processors,  which  generally  do  not  have  a  regis¬ 
ter  stack,  this  is  a  little  more  difficult  to  achieve  with  the  29K  family.  In  addition,  29K 
processors  do  not  have  microcode  to  automatically  save  their  interrupted  context. 
The  interrupt  architecture  of  a  29K  processor  is  very  flexible  and  is  dealt  with  in  de¬ 
tail  in  Chapter  4.  This  section  presents  two  useful  techniques  enabling  C  language 
code  to  be  used  for  interrupts  supported  by  a  HIF  conforming  operating  system. 

The  characteristics  of  the  C  handler  function  are  important  in  determining  the 
steps  which  must  be  taken  before  the  handler  can  execute.  It  is  desirable  that  the  C 
handler  run  in  Freeze  mode  because  this  will  reduce  the  overhead  costs.  These  costs 
are  incurred  because  interrupts  may  occur  at  times  when  the  processor  is  operating  in 
a  condition  not  suitable  for  immediately  commencing  interrupt  processing.  Most  of 
these  overheads  are  concerned  with  register  stack  support  and  are  described  in  detail 
in  section  4.4.  This  section  deals  with  establishing  an  interrupt  handler  which  can  run 
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in  Freeze  mode.  The  following  section  2.5.3  deals  with  all  other  types  of  C  language 
interrupt  handlers. 

A  C  language  interrupt  handler  qualifies  for  Freeze  mode  execution  if  it  meets 
with  a  number  of  criteria: 

■  It  is  a  small  leaf  routine  which  does  not  attempt  to  lower  the  register  stack 
pointer.  This  means  that,  should  the  interrupt  have  occurred  during  a  critical 
stage  in  register  stack  management,  the  stack  need  not  be  brought  to  a  valid 
condition. 

■  Floating-point  instructions  not  directly  supported  by  the  processor  are  not  used. 
Many  members  of  the  29K  family  emulate  floating-point  instructions  in 
software  (see  Chapter  3). 

■  Instructions  which  may  result  in  a  trap  are  not  used.  All  interrupts  and  traps  are 
disabled  while  in  Freeze  mode.  This  means  the  Memory  Management  Unit 
cannot  be  used  for  memory  access  protection  and  address  translation. 

■  The  handlers  execution  is  short.  Because  the  handler  is  to  be  run  in  Freeze  mode 
its  execution  time  will  add  to  the  system  interrupt  latency. 

■  The  handler  does  not  attempt  to  execute  LOADM  and  STOREM  instructions 
while  in  Freeze  mode.  When  a  performance  gain  can  be  had,  the  High  C  29K 
compiler  will  use  these  instructions  to  move  blocks  of  data;  this  does  not 
typically  happen  with  short  Freeze  mode  interrupt  handlers.  However,  the  High 
C  29K  compiler  supports  the  _LOADM_STOREM  pragma  which  can  be  used 
to  turn  off  or  on  (default)  the  use  of  LOADM  and  STOREM  instructions. 

■  Transparent  procedure  calls  are  not  used  (see  section  3.7).  They  typically 
require  the  support  of  indirect  pointer  which  are  not  temporarily  saved  by  the 
code  presented  in  this  section. 

The  methods  shown  in  this  and  the  following  section  rely  on  application  code 
mnning  with  physical  addressing;  or  if  the  Memory  Management  Unit  is  used  to  per¬ 
form  address  translation,  then  virtual  addresses  are  mapped  directly  to  physical  ad¬ 
dresses.  This  is  because  the  macros  used  to  install  the  Freeze  Mode  trap  handlers  are 
used  to  generate  code  in  User  mode  and  thus  operate  with  User  mode  address  values. 
However,  Freeze  mode  code  runs  in  Supervisor  mode  with  address  translation  turned 
off. 

The  Metaware  High  C  29K  and  GCC  compilers  prior  to  release  3.2  have  no  C 
language  extension  to  aid  with  interrupt  handling.  Release  3.2,  or  newer,  support  the 
key  word  Interrupt  as  a  procedure  return  type.  Use  of  this  C  language  extension 
results  in  additional  tag  data  (see  section  3.6)  preceding  the  interrupt  handler  routine. 
Without  the  intermpt  tag  data,  the  only  way  to  identify  if  a  handler  routine  qualifies 
for  the  above  Freeze  mode  handler  status,  is  to  compile  it  with  the  “-S”  option  and 
examine  the  assembly  language  code.  Alternatively,  handler  routines  which  make 
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function  calls  can  be  immediately  eliminated  as  unsuitable  for  operation  in  Freeze 
mode.  Examining  the  assembly  language  code  would  enable  the  nregs  value  used  in 
the  following  code  to  be  determined.  Small  leaf  routines  operate  with  global  registers 
only.  Starting  with  gr96,  nregs  is  the  number  of  global  registers  used  by  a  C  leaf  han¬ 
dler  routine. 

The  interrupt_handler  macro  defined  below  can  be  used  to  install  a  C  level 
interrupt  handler  which  is  called  upon  when  the  appropriate  trap  or  interrupt  oc¬ 
curs. The  code  is  written  in  assembly  language  because  it  must  use  a  carefully  crafted 
instruction  sequence;  the  first  part  of  which  uses  the  HIF  settrap  service  to  install,  in 
the  processor  vector  table,  the  address  ($1)  which  will  be  vectored  to  when  the  inter¬ 
rupt  occurs.  The  necessary  code  is  written  as  a  macro  rather  than  a  procedure  call  be¬ 
cause  the  second  part  of  the  macro  contains  the  start  of  the  actual  interrupt  handler 
code.  This  code,  starting  at  address  $1,  is  unique  to  each  interrupt  and  can  not  be 
shared.  Note,  the  code  makes  use  of  push  and  pop  macro  instructions  to  transfer  data 
between  registers  and  the  memory  stack.  These  assembly  macros  are  described  in 
section  3.3.1. 


.  reg  it 0 , gr 64 ; freeze  mode  interrupt 
.  reg  itl ,  gr 65 ; temporary  registers 

; install  interrupt  handler 

.macro  interrupt_handler ,  trap_number,  C_handler,  nregs 


*■  $1: 


$3: 


sub 

grl, grl, 4*4 

; get  Ir0-lr3  space 

asgeu 

V_SPILL, grl, rab 

; check  for  stack  spill 

add 

lr 1 , gr 12 1 , 0 

; save  grl21 

add 

IrO, gr96, 0 

; save  gr96 

const 

grl21, 290 

; HIF  2.0  SETTRAP  service 

const 

lr2 , trap_number 

;trap  number,  macro  parameti 

const 

lr3, $1 

;trap  handler  address 

consth 

!r3, $1 

asneq 

69, grl , grl 

;HIF  service  request 

add 

gr 12 1 , lr 1 , 0 

; restore  grl21 

add 

gr96, IrO, 0 

; restore  gr96 

add 

grl, grl, 4*4 

; restore  stack 

jmp 

$2 

;macro  code  finished 

asleu 

V_FILL,  lrl ,  rfb 

; check  for  stack  fill 

push 

msp, IrO 

; start  of  Interrupt  handler 

pushsr 

msp, itl , ipa 

; save  special  reg.  ipa 

const 

itO, nregs-2 

; number  or  regs .  to  save 

const 

itl, 96<<2 

; starting  with  gr96 

mtsr 

ipa, itl 

add 

itl, itl, 1<<2 

; increment  ipa 

sub 

msp, msp, 4 

; decrement  stack  pointer 

jmpfdec 

itO, $3 

store 

0,0, grO , msp 

; save  global  registers 

const 

IrO, C_handler 

consth 

lr 0 , C_handler 

calli 

IrO , IrO 

;call  C  level  handler 
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$4: 


nop 

const 
const 

ratsr  ipa, itl 

load  0,0,gr0,msp 

sub  itl,itl,l«2 

jmpfdec  itO, $4 

add  msp, msp, 4 

popsr  ipa,itO,msp 

pop  lrO,msp 

iret 

$2: 

.  endm 


; number  of  global  registers 


; restore  global  register 
/decrement  ipa 

; increment  stack  pointer 
/restore  ipa 
/restore  IrO 


itO , nregs-2 
itl, ( 96+nregs-l ) <<2 


Because  the  C  level  handler  is  intended  to  run  in  Freeze  mode,  there  is  very  little 
code  before  the  required  handler,  CJiandler,  is  called.  Registers  IrO  and  IPA  are 
saved  on  the  memory  stack  before  they  are  temporarily  used.  Then  the  required  num¬ 
ber  of  global  registers  ( nregs )  starting  with  gr96  are  also  saved  on  the  stack.  The  pro¬ 
grammer  must  determine  the  nregs  value  by  examining  the  handler  routine  assembly 
code. 

The  interrupt_handler  macro  must  be  used  in  an  assembly  language  module. 
Alternatively,  a  C  language  compiler  extension  can  be  used.  The  High  C  29K  compil¬ 
er  supports  an  extension  which  enables  assembly  code  to  be  directly  inserted  into  C 
code  modules.  This  enables  a  C  macro  to  be  defined  which  will  call  upon  the  assem¬ 
bly  language  macro  code.  The  example  code  below  shows  the  C  macro  definition. 

tdefine  interrupt_handler (tap_number ,  C_handler,  nregs)  \ 

/*int  trap_number;  \ 
void  (*C_handler) ()/  \ 
int  nregs/  */  \ 

_ASM("  interrupt_handler  "#trap_number", "#C_handler", "tnregs) / 

Alternatively  the  C  macro  could  contain  the  assembly  macro  code  directly.  Us¬ 
ing  the  technique  shown,  C  modules  which  use  the  macro  must  be  first  compiled  with 
the  “-S”  option;  this  results  in  an  assembly  language  output  file.  The  assembly  lan¬ 
guage  file  (.s  file)  is  then  assembled  with  an  include  file  which  contains  the  macro 
definition.  Note,  C  modules  which  use  the  macro  must  use  the  _ASM(“assembly- 
string”)  C  extension  to  include  the  assembly  language  macro  file  (shown  below)  for 
its  later  use  by  the  assembler.  The  GCC  compiler  supports  the  asm(“assembly- 
string”)  C  extension  which  achieves  the  same  result  as  the  High  C  29K  _ASM(“as- 
sembly-string”)  extension. 

_ASM("  .include  \"interrupt_macros . h\"" ) / 

/*  int2_handler  uses  8  regs .  and  is  called 
when  hardware  trap  number  18  occurs  */ 
interrupt_handler (18, int2_handler , 8) / 
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2.5.1  An  Interrupt  Context  Cache  with  High  C  29K 


The  interrupt_handler  macro  code,  described  in  the  previous  section,  pre¬ 
pares  the  processor  to  handle  a  C  language  interrupt  handler  which  can  operate  within 
the  processor  Freeze  mode  restrictions.  The  code  saves  the  intermpted  processor 
context  onto  the  current  memory  stack  position  before  calling  the  C  handler. 

The  interrupt_cache  macro  shown  below  can  be  used  in  place  of  the  previously 
described  macro.  Its  use  is  also  restricted  to  preparing  the  processor  to  handle  a  C 
level  handler  which  meets  the  Freeze  mode  execution  criteria.  However,  its  operation 
is  considerably  faster  due  to  the  use  of  an  Interrupt  Context  Cache.  Section  4.3.9  de¬ 
scribes  context  caching  in  more  detail.  A  cache  is  used  here  only  to  save  sufficient 
context  to  enable  a  non-interruptable  C  level  handler  to  execute. 

The  cache  is  implemented  using  operating  system  registers  gr64-gr80.  These 
global  registers  are  considered  operating  system  temporaries,  at  least  gr64-gr79  are 
(also  known  as  it0-it3  and  ktO-ktll).  Register  gr80  (known  as  ksO)  is  generally  used 
to  hold  operating  system  static  data  (see  section  3.3).  Processors  which  do  not  direct¬ 
ly  support  floating-point  operations  contain  instruction  emulation  software  (trap- 
ware)  which  normally  uses  registers  in  the  gr64-gr79  range  to  support  instruction 
emulation.  Given  application  code  can  perform  a  floating-point  operation  at  any 
time,  an  operating  system  can  not  consider  these  registers  contents  remain  static  after 
application  code  has  run.  For  this  reason  and  others,  floating-point  trapware  normal¬ 
ly  mns  with  interrupts  turned  off,  it  is  convenient  to  use  these  registers  for  interrupted 
context  caching. 

The  interrupt_handler  macro  uses  a  loop  to  preserve  the  global  registers  used 
by  the  Freeze  mode  interrupt  handler.  The  interrupt_cache  macro  unrolls  the  loop 
and  uses  register-to-register  operations  rather  than  register-to-memory.  In  place  of 
traversing  the  loop  nregs  times,  the  nregs  value  is  used  to  determine  the  required 
entry  point  to  the  unrolled  code.  These  techniques  reduce  interrupt  preparation  times 
and  interrupt  latency. 


macro  interrupt_cache 


sub 

grl, grl, 4*4 

asgeu 

V_SPILL, grl, rab 

add 

lrl, grl21,  0 

add 

IrO, gr96, 0 

const 

grl21, 290 

const 

lr2 , trap_number 

const 

lr3, $l-(nregs*4) 

consth 

lr3, $1- (nregs*4) 

asneq 

69, grl , grl 

add 

gr 12 1 , lrl, 0 

add 

gr96, IrO, 0 

add 

grl, grl, 4*4 

jmp 

$2 

asleu  V_FILL, lrl, rfb 


C_handler,  nregs 

; get  Ir0-lr3  space 
; check  for  stack  spill 
; save  grl21 
; save  gr96 

;  HIF  2.0  SETTRAP  service 
;trap  number,  macro  parameter 
;trap  handler  address 

;  HIF  service  request 
; restore  grl21 
; restore  gr96 
; restore  stack 
;macro  code  finished 
; check  for  stack  fill 


trap_n umber, 


Chapter  2  Applications  Programming 


127 


add 

gr80, grill, 0 

;  save  grill  to  interrupt 

add 

gr79, grllO, 0 

;  context  cache 

add 

gr78, grl09, 0 

add 

gr77 , grl08, 0 

;the  interrupt  handler  starts 

add 

gr7  6 , grl07 , 0 

; somewhere  in  this  code  range 

add 

gr75, grl06, 0 

; depending  on  the  register 

add 

gr74, grl05, 0 

; usage  of  the  C  level  code 

add 

gr73, grl04, 0 

add 

gr72, grl03, 0 

add 

gr71 , grl 02 , 0 

add 

gr70, grlOl, 0 

add 

gr69, grlOO, 0 

add 

gr68, gr99, 0 

add 

gr67, gr98, 0 

add 

gr66, gr97, 0 

; save  gr97 

add 

gr 64 , IrO, 0 

; save  IrO 

const 

IrO, C_handler 

consth 

IrO, C_handler 

calli 

IrO , IrO 

; call  C  level  handler 

add 

gr65, gr96, 0 

; save  gr96 

jmp 

$2-4- (nregs*4 ) 

; determine  registers  used 

add 

IrO , gr 64 , 0 

; restore  IrO 

add 

grill, gr80, 0 

; restore  grill  from  interrupt 

add 

grllO, gr79, 0 

;  context  cache 

add 

grl09, gr78 , 0 

add 

grl08, gr77, 0 

add 

gr 107 , gr7  6, 0 

add 

grl06, gr75, 0 

add 

grl05, gr74, 0 

add 

gr 104 , gr73, 0 

add 

grl03, gr72, 0 

add 

gr 102 , gr71 , 0 

add 

grlOl, gr70, 0 

add 

grlOO, gr69, 0 

add 

gr99, gr68, 0 

add 

gr 98 , gr 67 , 0 

add 

gr97 , gr66, 0 

add 

iret 

gr96, gr65, 0 

;retsore  gr96 

$2: 

.  endm 


2.5.2  An  Interrupt  Context  Cache  with  GNU 

The  previous  section  presented  interrupt  context  caching  when  using  the  Meta¬ 
ware  High  C  29K  compiler.  Global  register  assignment  with  the  Free  Software 
Foundation  compiler,  GCC,  is  very  different  from  High  C  29K.  Global  registers 
gr96-grlll  are  little  used,  except  for  return  values.  GCC  has  very  frugal  global  regis¬ 
ter  usage.  It  mainly  uses  global  registers  grll6-grl20.  This  effects  the  interrupt 
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preparation  code  required  for  Freeze  mode  C  level  handlers.  High  C  29K  uses  global 
registers  in  the  gr96-grlll  range  as  temporaries  before  starting  to  use  g rll6-grl20. 
The  reduced  use  of  global  registers  might  make  GCC  a  better  choice  for  building 
Freeze  mode  C-level  interrupt  handlers. 

The  assembler,  as29,  supplied  with  the  GCC  compiler  chain  does  not  support 
macros  directly.  But  it  is  possible  to  use  the  C  preprocessor,  CPP,  to  do  macro  instruc¬ 
tion  expansion.  The  interrupt_cache  macro  shown  below  demonstrates  the  use  of 
CPP  with  29K  assembly  code.  The  macro  is  used  to  install  a  C  handler  for  the  selected 
trap_number.  The  early  part  of  the  macro  code  requests  the  HIF  settrap  service  be 
used  to  insert  the  interrupt  handler  address  into  the  processor  vector  table.  The  actual 
address  inserted  depends  on  the  register  usage  of  the  C  handler. 

The  handler  must  be  examined  to  determine  the  registers  used.  Parameter  nregs 
is  used  to  specify  the  number  of  registers  used  in  the  grll6-grl20  range.  The  handler 
preparation  code  saves  the  necessary  global  registers  in  an  interrupt  context  cache 
before  calling  the  C  code.  Global  registers  gr96-grlll  are  not  saved  in  the  cache,  as  it 
is  likely  that  they  are  not  used  by  the  handler  —  it  certainly  has  no  return  value. 

The  context  cache  is  formed  with  global  registers  gr64-gr80.  Registers 
gr64-gr79  are  used  by  floating-point  emulation  routines,  and  hence  their  contents 
are  available  for  use  between  floating-point  trap  instructions.  This  assumes  that  the 
trapware  runs  with  interrupts  turned  off  which  is  normally  the  case.  For  more  details 
see  section  2.5.  Saving  the  registers  used  by  the  handler  in  this  way  is  much  faster 
than  pushing  the  registers  onto  an  off-chip  memory  stack. 


fdefine  interrupt_cache (trap_number ,  C_handler,  nregs) \ 

; start  of  interrupt_cache  macro,  nregs  must  be  >=1  _CR_\ 

nop  ; delay  slot  protection  _CR_\ 

sub  grl,grl,4*4  ; get  Ir0-lr3  space  _CR_\ 

asgeu  V_SPILL, grl, rab  ; check  for  stack  spill  _CR_\ 

add  lrl,grl21,0  ; save  grl21  _CR_\ 

add  Ir0,gr96,0  ; save  gr96  _CR_\ 

const  gr 12 1,2  90  ;  HIF  2.0  SETTRAP  service  _CR_\ 

const  lr2 , trap_number  ;trap  number,  macro  parameter_CR_\ 

const  lr3 ,  cache_##trap_number- (nregs*4 )  /handler  adds._CR_\ 
consth  lr3 , cache_##trap_number- (nregs*4 )  /  _CR_\ 

asneq  69, grl, grl  ; HIF  service  request  _CR_\ 

add  grl21,lrl,0  /restore  grl21  _CR_\ 

add  gr96,lr0,0  /restore  gr96  _CR_\ 

add  grl, grl, 4*4  /restore  stack  _CR_\ 

jmp  cache_end_##trap_number  /install  code  finished  _CR_ 

asleu  V_FILL, lrl , rfb  /check  for  stack  fill  _CR_\ 


/ START  of  interrupt  handler  code_CR_\ 


add 

gr70, grl20, 0 

;  save 

grl20 

_CR_ 

A 

add 

gr69, grll9, 0 

;  save 

grl  1 9 

_CR_ 

A 

add 

gr68, grll8, 0 

;  save 

grll8 

_CR_ 

A 

add 

gr 67 , grl 17 , 0 

;  save 

grll7 

_CR_ 

A 

add 

gr 64 , IrO, 0 

;  save 

IrO 

_CR_\ 

cache_##trap_number : 


/gr96-grlll  not  saved  in  cache  _CR_\ 
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const  lr 0 , C_handler 

consth  IrO , C_handler 

calli  IrO , IrO 

add  gr66,grll6,0 

Z  _CR_\ 

jmp  L2-4- (nregs*4 ) 

add  IrO , gr 64 , 0 

add  grl20,gr70,0 

add  grll9,gr69,0 

add  grll8,gr68,0 

add  grll7,gr67,0 

add  grll6,gr66,0 

iret 

cache_end_##trap_number : 


;call  C-level  handler_CR_\ 

;  _CR_\ 

; call  C  level  handler  _CR_\ 

; save  grll6  _CR_\ 

/determine  registers  used  _CR_\ 
/restore  IrO  _CR_\ 

/restore  grl20  from  cache  _CR_\ 

/  _CR_\ 

/  _CR_\ 

/  _CR_\ 

/  _CR_\ 

/  _CR_\ 

/end  of  interrupt  cache  macro  _CR_ 


The  code  example  below  shows  how  the  macro  can  be  invoked.  The  routine 
install_handlers()  is  written  in  assembly  code.  It  includes  a  macro  for  a  C  level  inter¬ 
rupt  handler,  int2_handler(),  assigned  to  29K  interrupt  INTR2.  The  C  level  handler 
was  examined  and  found  to  be  a  qualifying  leaf  routine  using  only  two  global  regis¬ 
ters. 


.  text 

.extern  _int2_handler 
.global  _install_handlers 
_install_handlers : 

sub  grl,grl,2*4  /prologue  not  realy  needed 

asgeu  V_SPILL, grl, grl26  /lower  stack  pointer 
interrupt_cache ( 18 , _int2_handler , 2 )  /macro  example 
add  grl, grl, 2*4  /raise  stack  pointer 

constn  gr96,-l  /return  TRUE  value 

jmpi  IrO  / return 

asleu  V_FILL, lrl, rfb  /procedure  epilogue 

The  C  preprocessor  is  invoked  with  the  app  shell  script  program  shown  below. 
This  is  a  convenient  way  of  directing  CPP  to  process  an  assembly  program  source 
file.  The  use  of  CPP  has  one  problem;  macros  are  expanded  into  long  lines.  The  car¬ 
riage  returns  in  the  macro  source  file  do  not  appear  in  the  expanded  code.  To  reinsert 
the  carriage  returns  and  make  the  assembly  code  lines  compatible  with  assembler 
syntax,  each  assembly  line  in  the  macro  is  marked  with  the  token  _CR_.  The  UNIX 
stream  editor,  sed,  is  then  used  to  replace  the  _CR_  with  a  carriage  return. 


# ! /bin/sh 

#start  of  app  shell  script 
iexample,  "app  file_in.s" 
prams=$* 

tmp=/tmp/ expand . $$ 

cpp  -P  $prams  >  $tmp  #invoke  CPP 

sed  ' s/_CR_/\ 

/ q'  $tmp 
rm  $tmp 
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2.5.3  Using  Signals  to  Deal  with  Interrupts 

Some  C  language  interrupt  handlers  will  not  be  able  to  run  in  Freeze  mode.;  be¬ 
cause  (as  described  in  section  2.5)  they  are  unsuitable  leaf  routines,  or  are  not  leaf 
routines  and  thus  require  use  of  the  register  stack.  In  this  case  the  signal  trampoline 
code  described  in  section  4.4  and  Appendix  B  must  be  used.  The  trampoline  code  is 
called  by  the  Freeze  mode  interrupt  handler  after  critical  registers  have  been  saved  on 
the  memory  stack.  The  C  language  handler  is  called  by  the  trampoline  code  after  the 
register  stack  is  prepared  for  further  use.  Note  that  interrupts  can  occur  at  times  when 
the  register  stack  condition  is  not  immediately  usable  by  a  C  language  handler. 

The  signal  mechanism  works  by  registering  a  signal  handler  function  address 
for  use  when  a  particular  signal  number  occurs.  This  is  done  with  the  library  routine 
signal().  Signals  are  normally  generated  by  abnormal  events  and  the  signal()  routine 
allows  the  operating  system  to  call  a  user  supplied  routine  which  will  be  called  to  deal 
with  the  event.  The  signal()  function  uses  the  signal  HIF  service  to  supply  the  address 
of  a  library  routine  ( sigcode )  which  will  be  called  for  all  signals  generated.  (Note, 
only  the  signal,  settrap  and  sigret- type  subset  of  HIF  services  are  required.)  The  li¬ 
brary  routine  is  then  responsible  for  calling  the  appropriate  C  handler  from  a  table  of 
handlers  indexed  by  the  signal  number.  When  signal()  is  used  a  table  entry  is 
constructed  for  the  indicated  signal. 


signal (sig_number,  func) 
int  sig_number; 
void  (*func) (sig_number) ; 


A  signal  can  only  be  generated  for  an  interrupt  if  the  code  vectored  to  by  the  in¬ 
terrupt  calls  the  shared  library  routine  known  as  the  trampoline  code.  It  is  known  as 
the  trampoline  code  because  signals  bounce  from  this  code  to  the  registered  signal 
handler.  To  ensure  that  the  trampoline  code  is  called  when  an  interrupt  occurs,  the 
Freeze  mode  code  vectored  to  by  the  interrupt  must  pass  execution  to  the  trampoline 
code,  indicating  the  signal  which  has  occurred.  The  signal_associate  macro  shown 
below  can  be  used  to  install  the  Freeze  Mode  code  and  associate  a  signal  number  with 
the  interrupt  or  trap  hardware  number. 


.  reg  it 0 , gr 64 ; freeze  mode  interrupt 
.  reg  itl , gr 65; temporary  registers 

.macro  signal_associate,  trap_number,  sig_number 

sub  grl,grl,4*4  ; get  Ir0-lr3  space 

asgeu  V_SPILL, grl, rab  ; check  for  stack  spill 

add  lrl,grl21,0  ; save  grl21 

add  Ir0,gr96,0  ; save  gr96 

const  gr 12 1 , 290  ; HIF  2.0  SETTRAP  service 

const  lr2 , trap_number  ;trap  number,  macro  parameter 

const  lr3,$l  ;trap  handler  address 
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consth 

lr3, $1 

asneq 

69, grl , grl 

add 

gr 12 1 , lrl, 0 

add 

gr96, IrO, 0 

add 

grl, grl, 4*4 

jmp 

$2 

asleu 

V_FILL, lrl, rfb 

const 

itO, sig_number 

push 

msp, itO 

push 

msp, grl 

push 

msp, rab 

const 

itO, 512 

sub 

rab, rfb, itO 

pushsr 

msp, itO, pcO 

pushsr 

msp, it  0 , pci 

pushsr 

msp, it  0 , pc2 

pushsr 

msp, itO, cha 

pushsr 

msp, it  0 , chd 

pushsr 

msp, itO, chc 

pushsr 

msp, it  0 , alu 

pushsr 

msp, itO , ops 

push 

msp, tav 

mf  sr 

itO, ops 

or 

itO, itO, 0x2 

mtsr 

ops , itO 

mtsrim 

chc,  0 

const 

it  1 , RegSigHand 

consth 

itl, RegSigHand 

load 

0 ,  0 ,  it  1 ,  it  1 

cpeq 

itO, itl, 0 

jmpt 

itO, SigDfl 

add 

itO , itl, 4 

mtsr 

pci, itl 

mtsr 

pcO, itO 

iret 

$2: 

.  endm 


;  HIF  service  request 
; restore  grl21 
; restore  gr96 
; restore  stack 
;macro  code  finished 
; check  for  stack  fill 

; start  of  Interrupt  handler 

;push  sig_number  on 
;  interrupt  context  frame. 
;use  push  macro, 

;  see  section  3.3.1 
;  set  rab  =  rfb-WindowSize 

;push  special  registers 


;push  tav  (grl21) 
set  DI  in  CPS,  but  timer 
;  interrupts  are  still  on 
;this  disables  interrupts 
;  in  signal  trampoline  code 

; the  trampoline  code  is 
;  described  in  section  4.4.1 
;RegSigHand  is  a  library 
;  variable 

;test  for  no  handler 
; jmup  if  no  handler (s) 

;itl  has  trampoline  address 
;  IRET  to  signal 
;  trampoline  code 

;  end  of  macro 


The  above  macro  code  does  not  disable  the  interrupt  from  the  requesting  device. 
This  is  necessary  for  external  interrupts;  reenabling  interrupts  without  having  first 
removed  the  current  interrupt  request,  shall  cause  the  interrupt  to  be  immediately  tak¬ 
en  again.  The  code  sets  the  the  Di-bit  in  the  OPS  special  register;  this  means  inter¬ 
rupts  will  remain  disabled  in  the  trampoline  code.  It  will  be  the  responsibility  of  the  C 
language  handler  to  clear  the  interrupt  request;  this  may  require  accessing  an  off- 
chip  peripheral  device.  An  alternative  is  to  clear  the  interrupt  request  in  the  above 
Freeze  mode  code  and  not  set  the  Di-bit  in  the  OPS.  This  would  enable  the  trampo¬ 
line  and  C  language  handler  code  to  execute  with  interrupts  enabled.  This  would  lead 
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to  the  possibility  of  nested  signal  events;  however,  the  signal  trampoline  code  is  able 
to  deal  with  such  complex  events. 

With  the  example  signal_associate  macro  the  trampoline  code  and  the  C  han¬ 
dler  run  in  the  processor  mode  at  the  time  the  interrupt  occurred.  They  can  be  forced 
to  run  in  Supervisor  mode  by  setting  the  Supervisor  mode  bit  (SM-bit)  when  OR-ing 
the  Di-bit  in  the  OPS  register.  Supervisor  mode  may  be  required  to  enable  accessing 
of  the  interrupting  device  when  disabling  the  interrupt  request.  The  address  transla¬ 
tion  bits  (PA  and  PD)  may  also  be  set  at  this  time  to  turn  off  virtual  addressing  during 
interrupt  processing.  To  make  these  changes  to  the  above  example  code,  the  value 
0x72  should  be  OR-ed  with  the  OPS  register  rather  than  the  0x2  value  shown. 

As  described  in  section  2.5,  a  C  language  macro  can  be  used  to  access  the  assem¬ 
bly  level  macro  instruction.  When  the  High  C  29K  compiler  is  being  used,  the  defini¬ 
tion  of  the  C  macro  is  shown  below.  Users  of  the  GCC  compiler  should  replace  the 
_ASM()  call  with  the  equivalent  asm()  C  language  extension. 

#define  signal_associate (tap_number,  sig_number)  \ 

/*int  trap_number;  \ 
int  sig_number;  */  \ 

_ASM("  signal_associate  "#trap_nuraber", "fsig_nuraber) ; 

When  the  macro  is  used  to  associate  a  signal  number  with  a  processor  trap  num¬ 
ber,  it  is  also  necessary  to  supply  the  address  of  the  C  language  signal  handler  called 
when  the  signal  occurs.  The  following  example  associates  trap  number  1 8  (floating¬ 
point  exception)  with  signal  number  8.  This  signal  is  known  to  UNIX  and  HIF  users 
as  SIGFPE;  when  it  occurs,  the  C  handler  sigfpe_handler  is  called. 

_ASM("  .include  \"interrupt_macros.h\""); 
signal_associate ( 18 , 8 ) ;  /*  trap  18,  F-P  */ 

signal ( 8 , sigfpe_handler) ;  /*  signal  8  handler  */ ' 

C  language  signal  handlers  are  free  of  many  of  the  restrictions  which  apply  to 
Freeze  mode  interrupt  handlers.  However,  the  HIF  specification  still  restricts  their 
operation  to  some  extent.  Signal  handlers  can  only  use  HIF  services  with  service 
numbers  greater  than  256.  This  means  that  printf()  cannot  be  used.  The  reason  for 
this  is  HIF  services  below  256  are  not  reentrant,  and  a  signal  may  occur  while  just 
such  a  HIF  service  request  was  being  processed.  Return  from  the  signal  handler  must 
be  via  one  of  the  signal  return  services:  sigdft,  sigret ,  sigrep  or  sigskp.  If  the  signal 
handler  simply  returns,  the  trampoline  code  will  issue  a  sigdfl  service  request  on  be¬ 
half  of  the  signal  handler. 

A  single  C  level  signal  routine  can  be  used  to  dispatch  several  C  language  inter¬ 
rupt  handlers.  Section  4.3. 12  describes  an  interrupt  queuing  method,  where  interrupt 
handlers  run  in  Freeze  mode  and  build  an  interrupt  descriptor  (bead).  Each  descriptor 
is  placed  in  a  list  (string  of  beads)  and  a  Dispatcher  routine  is  used  to  process  descrip¬ 
tors.  The  signal  handling  method  described  above  can  be  used  to  register  a  C  level 
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Dispatcher  routine.  This  results  in  C  level  context  being  prepared  only  once  and  the 
Dispatcher  routine  calling  the  appropriate  C  handler. 

2.5.4  Interrupt  Tag  Words 

Release  3.2,  or  newer,  of  the  High  C  29K  compiler  supports  routines  of  defined 
return-type  Interrupt .  The  use  of  this  C  language  extension  causes  an  additional  tag 
word  to  be  placed  ahead  of  the  procedure  code.  Section  3.6  explains  the  format  of  the 
interrupt  tag  in  detail.  Note,  to  use  the  _Interrupt  key  word  with  a  PC  hosted  compiler, 
it  is  necessary  to  add  the  line  “#define  _Interrupt  _CC(_INTERRUPT)”  to  file 
29k/bin/hc29.pro.  The  interrupt  key  word  in  conjunction  with  some  simple  support 
routines  presented  below  make  optimizing  of  interrupt  preparation  very  easy.  By 
examining  the  interrupt  tag  word  it  is  possible  to  determine  if  a  handler  routine 
qualifies  for  Freeze  mode  execution  or  will  require  HIF  signal  processing.  The 
example  code  shown  below  is  for  a  HIF  conforming  operating  system.  However,  a 
different  operating  system  may  choose  to  respond  to  interrupt  tag  information  in  a 
somewhat  different  manner.  Only  the  signal,  settrap  and  sigret- type  subset  of  HIF 
services  are  required.  A  different  operating  system  may  have  equivalent  support 
services. 

When  an  interrupt  occurs,  it  would  be  possible  to  examine  the  interrupt  tag  word 
of  the  assigned  handler.  However,  this  would  be  an  overhead  encountered  at  each 
interrupt  and  it  would  increase  interrupt  processing  time.  It  is  better  to  examine  the 
tag  word  at  interrupt  installation  time  and  determine  the  necessary  interrupt 
preparation  code.  Preceding  sections  have  described  interrupt  context  caching  and 
signal  processing.  It  would  be  possible  to  examine  the  tag  word  in  more  detail  than 
the  following  example  code  undertakes.  This  would  produce  additional  intermediate 
performance  points  in  the  spectrum  of  interrupt  preparation  code;  context  caching 
being  the  fastest  point  on  the  spectrum  and  signal  processing  the  slowest.  However, 
signal  processing  can  always  be  used  and  is  free  of  the  restrictions  which  apply  to  the 
use  of  interrupt  context  caching,  and  context  caching  is  frequently  adequate.  This 
renders  the  chosen  spectrum  points  as  most  practicable. 

The  example  below  shows  two  C  language  interrupt  handler  routines.  The  first, 
f_handler(),  looks  like  it  will  qualify  for  Freeze  mode  execution.  The  key  word 
_lnterrupt  has  been  used  during  the  procedure  definition  and  this  will  result  in  a 
interrupt  tag  word.  The  second  function,  s_handler(),  is  not  a  leaf  procedure  and  this 
fact  will  be  reported  in  its  interrupt  tag  word.  Being  a  non  leaf  routine,  it  will  be 
processed  as  a  signal  handler.  Such  routines  receive  a  single  parameter  —  the  signal 
number. 


extern 

int 

sig_sig 

/* 

defined 

in 

library  code  */ 

extern 

int 

sig_intrO 

/* 

signal 

for 

INTRO  */ 

extern 

char 

*UART_p ; 

/* 

pointer 

to 

UART  * / 
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char 


recv_data [ 50 ]  ; 


_Interrupt  f_handler()  /*  Freeze  mode  handler  */ 

{ 

static  int  count=0; 

recv_dat a [count ] =*uart_p ;  /*  read  from  UART  */ 


if ( recv_data [ count ] =='  \n'  ) 
{  sig_sig=sig_intrO ; 

count  =  0  ; 


/*  test  for  end  */ 
/*  signal  #30  */ 

/*  reset  counter  */ 


else 


count++; 


_Interrupt  s_handler ( sig_number )  /*  signal  handler  */ 
int  sig_number;  /*  for  sig_intr0  */ 

{ 

printf("in  signal  handler  number=%d\n" ,  sig_number) ; 
printf ("received  string=%s\n",  recv_data) ; 

_sigret ( ) ; 

} 


Most  programmers  do  not  want  to  become  concerned  with  the  details  of 
interrupt  preparation.  They  simply  wish  to  call  an  operating  system  service  routine 
which  will  examine  the  interrupt  tag  word  and  select  the  appropriate  interrupt 
preparation  code.  The  library  procedure,  interrupt(),  shown  below,  is  just  such  a 
service  routine.  The  operation  of  this  procedure  will  be  described  a  little  later.  The 
procedure  ensures  that  either  interrupt  context  caching  or  signal  processing  will  be 
applied  for  the  supplied  handler  and  selected  29K  trap  number.  The  interrupt!) 
routine  must  be  executed  during  the  system  initialization  stage,  before  traps  or 
interrupts  are  expected  to  occur.  An  example  initialization  sequence  is  shown  below: 

int  sig_intr0; 

ma i n  ( ) 

( 

sig_intrO=interrupt (16, s_handler) ;  /*  INTRO  */ 
interrupt (17, f_handler) ;  /*  INTR1  */ 


Interrupt  tag  words  are  dealt  with  at  interrupt  installation  time,  and  not  at 
program  assembly  or  link  time.  There  have  been  discussions  about  adding  a  compiler 
pragma  option  to  High  C  29K  release  4.0  which,  when  switched  on,  will  cause  a 
macro  instruction  to  be  placed  in  output  assembly  code  rather  than  an  interrupt  tag 
word.  This  requires  that  the  relevant  C  code  be  compiled,  then  assembled  with  an 
include  file  which  defines  the  replacement  code  for  the  interrupt  macro  instruction. 
This  technique  has  some  disadvantages,  principally  that  the  macro  must  understand 
the  capabilities  of  the  operating  system  and  how  it  intends  dealing  with  interrupts.  In 
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particular;  if  the  interrupt  should  be  processed  in  User  or  Supervisor  mode,  with 
interrupts  enabled  or  disabled;  with  or  without  address  translation  and  so  on.  Use  of  a 
macro  does  have  the  advantage  that  the  interrupt  preparation  code  appears  in  the  final 
linked  program  image.  The  tag  word  methods  relies  on  the  preparation  code  being 
built  in  heap  memory  during  interrupt  installation.  The  preparation  code  is  built  in 
consultation  with  the  operating  system  and  is  thus  more  portable  between  different 
operating  systems  which  support  somewhat  different  interrupt  processing 
environments. 

Fortunately  for  the  user,  library  routines  are  responsible  for  installing  the 
executable  code  into  heap  memory.  The  code  itself  is  similar  to  the  code  of  previous 
sections.  A  portion  of  the  code  is  linked  into  text  space  of  the  program  image.  At 
installation  time  the  code  is  copied  into  heap  memory  and  further  optimized.  The 
code  sequence  below  is  used  for  interrupt  context  caching. 


.  text 
.align  4 

.global  _interrupt_cache_code 
.global  _interrupt_cache_end 
.extern  _sig_sig 
interrupt_cache_code : 


add 

gr80, grill, 0 

;  save  grill  to  interrupt 

add 

gr79, grllO, 0 

;  context  cache 

add 

gr78, grl09, 0 

add 

gr77 , grl08, 0 

add 

gr7  6, grl07 , 0 

add 

gr75, grl06, 0 

add 

gr74, grl05, 0 

add 

gr73, grl04, 0 

add 

gr72, grl03, 0 

add 

gr71 , grl02 , 0 

add 

gr70, grlOl, 0 

add 

gr69, grlOO, 0 

add 

gr68, gr99, 0 

add 

gr67, gr98, 0 

add 

gr66, gr97, 0 

add 

gr 64 , IrO, 0 

; save  IrO 

const 

IrO ,  0 

; const  and  consth 

consth 

IrO ,  0 

;  need  to  be  modified 

calli 

IrO , IrO 

;call  C  handler 

add 

gr65, gr96, 0 

add 

grill, gr80, 0 

; restore  grill  from 

add 

grllO, gr79, 0 

;  context  cache 

add 

grl09, gr78, 0 

add 

grl08, gr77, 0 

add 

gr 107 , gr7  6, 0 

add 

grl06, gr75, 0 

add 

grl05, gr74, 0 

add 

gr 104 , gr73, 0 

add 

grl03, gr72, 0 
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add 

gr 102 , gr71 , 0 

add 

grlOl, gr70, 0 

add 

grlOO, gr69, 0 

add 

gr99, gr68,  0 

add 

gr 98 , gr 67 , 0 

add 

gr97, gr66, 0 

add 

gr96, gr65, 0 

add 

IrO , gr 64 , 0 

; restore  IrO 

const 

gr 64 , _sig_sig 

;  the  following  ei 

consth 

gr 64 , _sig_sig 

;  instructions  de 

load 

0 , 0 , gr 64 , gr 64 

;  sig_sig  testing 

cpeq 

gr65, gr64 , 0 

;test  for  zero 

const 

gr66, _signal_ 

associate_code  +  4  ; 

consth 

gr66,_signal_ 

associate_code  +  4  ; 

jmpf  i 
nop 
iret 

gr65, gr66 

; jump  if  sig_sig 

addressing 


_interrupt_cache_end : 


The  context  cache  code  is  a  little  different  from  the  code  shown  in  section  2.5.1. 
Eight  extra  instruction  have  been  added  to  support  a  memory  variable  called  sig_sig. 
It  supports  a  very  useful  technique  of  two-level  interrupt  processing.  Predominantly 
a  Freeze  mode  interrupt  handler  is  used  alone.  However,  when  the  sig_sig  variable  is 
set  to  a  signal  number  before  the  Freeze  mode  handler  completes,  a  signal  is 
generated  causing  a  second  signal  handler  routine  to  execute  after  the  Freeze  mode 
handler  returns. 

Examine  the  example  handler  routines.  When  intermpt  INTR1  (vector  17) 
occurs,  the  Freeze  mode  handler,  f_handler(),  normally  accesses  the  interrupting 
UART  and  receives  a  character;  it  then  increments  the  count  value  and  returns.  The 
processes  of  accessing  the  UART  causes  the  interrupt  request  to  be  deasserted.  This 
results  in  a  very  fast  intermpt  handler  written  in  C.  However,  when  the  received 
character  is  a  ‘\n’  (carriage  return),  sig_sig  is  set  to  the  signal  number  allocated  to  the 
INTRO  signal  handler.  This  causes  the  s_handler()  to  be  executed  in  response  to  the 
signal.  The  occurrence  of  interrupt  INTRO  (vector  16)  also  causes  s_handler()  to 
execute  as  a  signal  handler  associated  with  trap  16.  The  example  interrupt()  service 
automatically  allocates  signal  numbers,  starting  with  SIGUSR1,  to  handler  routines 
which  are  to  be  processed  via  signal  trampoline  code.  The  interrupt()  procedure 
returns  the  selected  signal  number;  zero  is  returned  if  a  Freeze  mode  handler  is 
selected.  An  interrupt  handler  can  be  restricted  to  fast  Freeze  mode  processing  and 
when  more  extensive  processing  is  required  the  sig_sig  variable  can  be  set  and  a 
second  level  handler  invoked.  (Note,  the  s_handler()  routine  calls  the  printf() 
library  routine.  This  is  not  permitted  with  the  High  C  29K  library  routines  as  the 
printf()  routine  is  not  reentrant.  However,  the  use  of  printf()  helps  illustrate  the 
two-stage  principle.) 

To  perform  signal  processing,  the  trampoline  code  shown  below  is  placed  in 
heap  memory.  It  is  similar  to  the  code  of  section  2.5.3.  Intermpts  are  disabled  before 
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signal  processing  commences;  this  is  not  necessary  if  a  Freeze  mode  handler  has 
already  requested  the  interrupting  device  to  deassert  the  interrupt  request.  If  a  Freeze 
mode  handler  is  always  executed  before  the  associated  signal  handler,  the  three 
indicated  lines  of  code  can  be  removed.  Doing  so  enables  nested  interrupts  to  be 
supported  without  explicitly  reenabling  interrupts  in  the  signal  hander.  However,  if 
the  signal  preparation  code  is  called  directly  from  the  interrupt  vector  table  (via  an 
interrupting  device)  then  interrupts  must  be  initially  disabled  by  the  shared  signal 
preparation  code. 


.global  _signal_associate_code 
.global  _signal_associate_end 


. reg  itO, gr64 

. reg  it  1 , gr 65 

signal_associate_code  : 


const 

gr 64 , 0 

const 

it  1 ,  0 

const 

it2 , _sig_sig 

consth 

it2 , _sig_sig 

store 

0,0, itl,  it2 

push 

msp, itO 

push 

msp, grl 

push 

msp, rab 

const 

itO, 512 

sub 

rab, rfb, itO 

pushsr 

msp, itO, pcO 

pushsr 

msp, it  0 , pci 

pushsr 

msp, itO, pc2 

pushsr 

msp, it  0 , cha 

pushsr 

msp, itO, chd 

pushsr 

msp, it  0 , chc 

pushsr 

msp, it  0 , alu 

pushsr 

msp, itO , ops 

push 

msp, tav 

mf  sr 

itO, ops 

or 

itO, itO, 0x2 

mtsr 

ops, itO 

mtsrim 

chc,  0 

const 

itl, RegSigHand 

consth 

itl, RegSigHand 

load 

0 ,  0 ,  it  1 ,  it  1 

add 

itO, itl, 4 

mtsr 

pci , itl 

mtsr 

pcO , itO 

iret 

signal_associate_end : 


; signal  number  in  itO 
;push  signal  number  on  stack 
; clear  sig_sig  variable 
;  need  not  do  this  if  signal 
;  handler  is  called  directly 
;  from  vector  table  entry 
; interrupt  context  stack 
;use  'push' macro' 

;  see  section  3.3.1 

;  set  rab=rfb-WindowSize 

;push  special  registers 


;push  tav  (grl21) 
set  DI  in  CPS,  but  timer 
;  interrupts  are  still  on 
;this  disables  interrupts 
;  in  signal  trampoline  code 

;  the  trampoline  code  is 
;  described  in  section  4.4.1 
;RegSigHand  is  a  library 
;  variable 
; IRET  to  signal 
;  trampoline  code 


All  of  the  code  presented  is  available  from  AMD  in  source  and  linkable  library 
form.  Now  to  the  interrupt()  install  routine  itself,  it  is  listed  below  and  is 
surprisingly  short.  Its  operation  is  simple,  it  examines  the  interrupt  tag  word  of  the 
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supplied  C  handler.  Note  that  it  assumes  that  the  interrupt  procedure  has  a  one-word 
procedure  tag  preceded  by  an  interrupt  tag  word  —  this  is  almost  always  the  case.  If 
no  interrupt  tag  is  found  then  signal  handling  is  selected.  This  would  be  the  case  if  the 
handler  routine  had  been  built  with  the  GNU  compiler  which  does  not  currently 
support  interrupt  tag  words. 

Depending  on  the  tag  word,  Freeze  mode  or  signal  processing  is  selected  and  the 
appropriate  code  copied  into  heap  memory  space.  For  Freeze  mode  processing,  only 
the  required  number  of  global  registers  is  saved  in  the  interrupt  context  cache 
( gr64-gr80 ).  Additionally,  only  the  minimum  required  amount  of  heap  memory  is 
requested  via  the  HIF-library  malloc()  service.  After  copying  code  into  heap 
memory,  some  instruction  patching  is  performed  to  correctly  reference  the  assigned 
C  handler.  Finally  the  HIF-library  _settrap()  service  is  used  to  assign  a  trap  handler 
address  to  the  requested  trap  number.  Note  that  when  the  copying  is  performed,  the 
heap  memory  is  only  written  to  and  never  read.  This  will  prevent  the  code  being 
placed  into  on-chip  data  cache,  as  29K  family  data  caches  only  allocate  cache  blocks 
on  data  reads.  Avoiding  caching  of  the  relevant  heap  memory  ensures  that  the  new 
code  will  be  fetched  from  instruction  memory  (see  sections  5.13.2  and  5.14.4). 

int  interrupt (trap_number,  C_handler) 
int  trap_number; 
void  (*C_handler) (); 

{ 

int  *tag_p= ( int * ) C_handler  -  2; 

int  ret_sig;  /*  return  signal  value  */ 

int  tag_word  =  *tag_p; 

int  glob_regs,  *trap_handler ,  i,  size; 

_LOCK  volatile  int  *code_p,  *mem_p;  /*  see  section  5.14.1  */ 
if ( (tag_word  &  OxffOOOOOO)  !=  0) 

tag_word  =  — 1;  /*  no  interrupt  tag  word  */ 

if ( (tag_word  &  OxffffOOff ) ==0) 

{  glob_regs= (tag_word  &  OxffOO)  >>  8; 
code_p=&interrupt_cache_code ; 

size=4*  (  (2*glob_regs)  +6+8)  ;  < -  8  for  sig_sig  code  support 

mem  p= ( int *) malloc ( size)  /*  get  heap  memory  */ 

trap_handler=mem_p; 

code_p=code_p+ ( 1 6-glob_regs )  ;  /*  find  start  of  save  */ 
for(i=l;  i  <=glob_regs;  i++)  /*  copy  save  code  */ 

*mem_p++=*code_p++; 

/*  supply  address  to  CONST  instruction  * 
*mem_p++  =*code_p++  !  (  ( ( ( int ) C_handler&0xf f 00 ) <<8 ) 

+  ( (int) C_handler&0xff )  ); 

/*  supply  address  to  CONSTH  inst .  */ 
*mem_p++  =*code_p++  |  (  ((( int ) C_handler&0xf f 000000 )  >>8) 

+  ((  (int ) C_handler&0xff 0000 )  >>16)  ); 

for(i=l;  i  <=(4-2);  i++)  /*  copy  the  call  code  */ 

*mem_p++=*code_p++; 

code_p=code_p  +  ( 1 6-glob_regs ) ;  /*  find  start  of  restore  */ 

for (i=l; i<= (glob_regs+2+8) ; i++)  /*  copy  restore  code  */ 

*mem_p++=*code_p++;  k 

ret_sig=0 ;  > —  3  required  for  sig_sig  code  support 
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} 

else 

{static  int  sig_number=30 ;  /*  SIGUSR1  in  SigEntry  */ 

ret_sig=sig_number ; 
signal ( sig_number, C_handler) ; 

size=4* ( signal_associate_end-signal_associate_code ) ; 
mem_p= (int*) malloc (size)  ;  /*  get  heap  memory  */ 

trap_handler=mem_p; 
code_p=signal_associate_code ; 

/*  supply  sig_number  to  CONST  instruction  */ 
*mem_p++  =  *code_p++  |  (  ( (sig_number&Oxff 00) <<8) 

+  ( sig_number& Oxf f )  ); 

for(i=l;  i  <=  (size-1);  i++)  /*  copy  rest  of  code  */ 

*mem_p++  =  *code  p++; 
sig_number++; 

} 

_settrap (trap_number,  (void(*)  ( ) ) trap_handler ) ;  /*  HIF  service  */ 
return  ret_sig; 


Users  of  the  above  code  which  do  not  want  to  make  use  of  the  two-level 
interrupt  processing  supported  via  the  sig_sig  variable,  can  remove  the  extra  eight 
instructions  in  the  interrupt_cache_code  and  should  also  remove  the  extra  code 
copying  indicated  in  the  listing  above.  This  will  slightly  improve  interrupt 
processing  times  for  Freeze  mode  handlers.  Other  users  who  want  to  further  exploit 
the  two-level  approach  can  assign  a  single  handler  for  all  second  level  interrupt 
processing,  this  is  discussed  in  section  4.3. 12.  Interrupts  are  first  dealt  with  in  Freeze 
mode  by  building  an  interrupt  descriptor  bead;  then  a  second  level  Dispatcher  routine 
is  responsible  for  popping  beads  off  a  string  and  calling  the  assigned  second  level 
handler.  Alternatively,  a  signal  dispatcher  technique  can  be  applied;  section  2.5.6 
describes  the  method.  Signal  dispatching  can  be  achieved  entirely  with  support 
routines  accessible  from  C  level  —  this  makes  signal  dispatching  particularly 
attractive. 

If  the  interrupt^)  routine  is  used  extensively  for  multiple  signal  handlers,  it  will 
be  necessary  to  increase  the  size  of  the  signal  handler  array  (SigEntry,  described  in 
Appendix  B).  The  array  is  normally  large  enough  to  hold  signal  numbers  1  through 
32).  Unless  signal  allocation  is  started  at  a  number  less  than  SIGUSR1  (30),  there  is 
normally  only  sufficient  space  for  two  signal  handlers. 


2.5.5  Overloaded  INTR3 

The  microcontroller  members  of  the  29K  family  contain  several  on-chip  pe¬ 
ripherals.  These  peripherals  can  generate  interrupts  which  are  all  directed  to  the  core 
29K  processor  via  interrupt  line  INTR3.  This  causes  overloading  of  the  INTR3  vec¬ 
tor  handler.  When  a  microcontroller  receives  an  INTR3  interrupt,  it  must  examine  its 
Interrupt  Control  Register  (ICR)  to  determine  the  source  of  the  interrupt.  This  re- 
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quires  all  interrupts  to  initially  be  processed  via  the  INTR3  vector  handler.  The 
INTR3  handler  must  call  the  appropriate  device  service  routine.  The  service  routine 
first  clears  the  interrupt  request  by  writing  a  one  to  the  correct  bit  in  the  ICR;  it  can 
then  reenable  interrupts  and  service  the  current  request.  The  general  format  of  the 
ICR  is  shown  on  Figure  2-3. 
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Figure  2-3.  29K  Microcontroller  Interrupt  Control  Register 


The  overloading  of  INTR3  adds  complexity  to  the  task  of  building  a  Freeze 
mode  interrupt  handler  for  each  intermpting  device.  The  problem  can  be  resolved  by 
allocating  a  region  of  the  vector  table  for  use  by  the  interrupting  devices  sharing 
INTR3.  The  code  below  (intr3)  reserves  33  vector  table  entries  starting  with  vector 
220  —  these  vectors  are  not  normally  used  by  a  29K  based  system.  When  an  INTR3 
occurs,  the  code  examines  the  ICR  register  with  a  Count  Leading  Zeros  (CLZ) 
instruction.  This  assigns  the  highest  priority  to  the  bit  (interrupt)  which  is  most-left 
in  the  ICR  register.  The  value  produced  by  the  CLZ  instruction  is  added  to  the  base 
value  of  220  and  the  result  used  to  obtain  the  correct  vector  entry  from  the  vector 
table. 


.  reg 

itO, gr64 

.  reg 

itl , gr 65 

. global 

_intr3 

const 

itO, 0x80000028 

; Interrupt  Control  register 

consth 

itO, 0x80000028 

load 

0 , 0 , it  1 , itO 

clz 

itl, itl 

/priority  order  index 

const 

itO, 220 

/base  vector  number 

add 

itl, itO, itl 

; add  offset  to  base 

sll 

itl, itl, 2 

/convert  to  word  offset 

mf  sr 

itO, vab 

/get  vector  table  base 

add 

itl, itO, itl 

/get  handler  address 

load 

0, 0, itl, itl 

/  from  vector  table 
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jmpi  itl  ; jump  to  interrupt 

nop  ;  handler 

The  intr3  code  completes  by  jumping  to  the  selected  vector  handler.  Note,  the 
code  makes  use  of  the  four  interrupt  temporary  registers  (U0-it3,  gr64-gr67)  nor¬ 
mally  reserved  by  an  operating  system  for  interrupt  handling.  Each  peripheral  device 
which  can  set  a  bit  in  the  ICR  register  is  assigned  a  unique  vector  handler  number  in 
the  range  220-252.  If  no  bit  is  found  to  be  set  in  the  ICR  register,  vector  252  is  se¬ 
lected. 

Using  the  intr3  code,  it  is  possible  to  use  the  previously  described  interrupt() 
library  routine  to  deal  with  interrupts.  A  call  to  the  the  HIF  library  procedure 
_settrap()  is  required  to  install  the  intr3  code  for  INTR3  handling.  After  this  is  done, 
the  interruptO  routine  can  be  used  to  assign  interrupt  handlers  for  the  selected  vector 
numbers  in  the  220-252  range,  as  shown  below. 


ma i n  ( ) 
{ 


_settrap ( 1 9, intr3 ) ;  /*  INTR3  handler  */ 

interrupt (224 , VD_handler) ;  /*  VDI  */ 

interrupt (237 , DMA_handler) ;  /*  DMAOI  */ 


The  intr3  code  does  not  clear  the  interrupt  request  in  the  ICR  register;  this  is  left 
to  the  specific  interrupt  handler.  However,  this  is  insufficient  for  level  sensitive  EO 
port  interrupts.  In  this  case  the  interrupting  condition  must  first  be  cleared  for  the  cor¬ 
responding  PIO  signal  before  the  ICR  bit  is  cleared.  Consequently,  the  clearing  of  the 
bit  in  the  ICR  register  is  redundant. 

AMD  evaluation  boards  are  normally  supplied  with  a  combined  OS-boot  oper¬ 
ating  system  and  MiniMON29K  DebugCore  in  the  ROM  memory.  When  the  target 
processor  is  a  Microcontroller,  the  message  system  used  to  support  OS-boot  and  De- 
bugcore  communication  with  MonTIR  typically  uses  an  on-chip  UART.  All  on-chip 
peripheral  generated  interrupts  are  handled  via  INTR3.  MiniMON29K  bundle  3.0, 
and  earlier  versions,  are  built  using  OS-boot  version  2.0.  This  version  of  OS-boot 
assigned  the  INTR3  handler  for  MiniMON29K’s  sole  use.  This  makes  it  very  diffi¬ 
cult  to  add  additional  interrupt  handlers  for  on-chip  peripherals.  The  problem  can  be 
solved  by  applying  the  code  shown  below. 

ma i n  ( ) 

{ 

void  (*V_minimon) (); 

V_minimon= ( void ( * ) ( ) ) _settrap (19, intr3) ; /*  INTR3  */ 
_settrap (22 0+24 , V__minimon) ;  /*  RXSI  interrupt  */ 
_settrap (220+25, V_minimon) ;  /*  RXDI  interrupt  */ 
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sett rap (220+26, V_minimon) ;  /*  TXDI  interrupt  */ 


The  _settrap()  HIF  library  service  is  used  to  install  a  new  INTR3  handler;  the 
address  of  the  old  handler  is  returned.  The  MiniMON29K  code  is  used  to  process 
three  peripheral  interrupts  via  INTR3.  The  _settrap()  service  is  used  again  to  sepa¬ 
rately  reinstall  the  handlers  required  by  MiniMON29K.  New  interrupt  handlers  for 
additional  on-chip  peripherals  can  then  be  installed  with  further  calls  to  _settrap()  or 
inter  rupt(). 

2.5.6  A  Signal  Dispatcher 

Release  3.2,  or  newer,  of  the  High  C  29K  compiler  supports  routines  of  defined 
return-type  interrupt.  The  use  of  this  non-standard  keyword  was  explained  in  sec¬ 
tion  2.5 .4.  The  keyword  is  used  here  to  support  a  signal  dispatcher.  The  method  relies 
on  interrupts  being  processed  in  two  stages.  The  first  stage  operates  in  Freeze  mode. 
It  responds  immediately  to  the  interrupting  device,  captures  any  critical  data  and 
deactivates  the  interrupt  request.  The  second  stage,  if  required,  takes  the  form  of  a 
signal  handler.  The  sig_sig  variable  is  used  by  the  Freeze  mode  handler  to  request 
signal  handler  execution.  A  signal  handler  can  not  be  executed  without  a  freeze  mode 
handler  making  the  necessary  request.  This  is  because  interrupts  are  not  disabled  in 
the  signal  associate  code. 

The  technique  has  a  number  of  benefits:  It  is  seldom  necessary  to  disable  inter¬ 
rupts  for  long  periods,  as  asynchronous  interrupt  events  are  only  initially  dealt  with  in 
Freeze  mode.  This  reduces  interrupt  latency.  Signal  handlers  can  be  queued  for  pro¬ 
cessing  when  nested  interrupts  would  occur.  This  eliminates  the  need  to  prepare  a  C 
level  interrupt  processing  environment  for  each  interrupt.  A  C  level  environment 
need  only  be  built  for  a  Signal  Dispatcher  routine.  The  Signal  Dispatcher  is  then  re¬ 
sponsible  for  calling  the  appropriate  signal  handler  for  all  signals  generated  by  inter¬ 
rupts.  The  Signal  Dispatcher  is  started  in  response  to  the  first  signal  occurring.  The 
dispatcher  causes  execution  of  the  first  signal  handler,  then  determines  if  other  signal 
handlers  have  been  requested  while  the  current  signal  handler  was  executing.  The 
dispatcher  continues  to  processes  signals  until  there  are  none  remaining.  At  this  point 
the  original  interrupted  state  is  restored.  The  original  state  being  the  processor  state  at 
the  time  the  first  interrupt  in  the  sequence  occurred.  The  first  interrupt  occurred  while 
no  interrupt  or  signal  handler  was  being  processed;  and  it  caused  the  Signal  Dispatch¬ 
er  to  start  execution. 

Avoiding  nested  interrupts,  other  than  for  Freeze  mode  handling,  is  most 
beneficial  when  large  numbers  of  multiply  nested  interrupts  are  expected,  and  the 
cost  of  preparing  C  level  context  for  interrupt  processing  is  high.  For  example,  using 
interrupt  context  caching,  the  processor  can  be  prepared  for  Freeze  mode  interrupt 
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processing  in  1-2  micro  seconds  (at  16Mhz).  However,  with  an  Am29205 
microcontroller  which  has  a  1 6-bit  off-chip  bus  and  relatively  slow  DRAM  memory, 
as  much  as  40  micro  seconds  can  be  required  to  prepare  the  processor  for  a  C  level 
signal  handler.  In  this  case  it  is  best  to  prepare  for  C  level  interrupt  handling  only 
once.  Nested  interrupts  are  avoided  by  adding  new  interrupts  to  a  stack  when  further 
interrupts  occur  while  the  Signal  Dispatcher  is  executing. 

As  explained  in  section  2.5.4,  a  signal  handler  is  requested  when  the  sig_sig 
variable  is  set  by  a  Freeze  mode  handler.  Previous  example  code  showed  how  the  sig¬ 
nal  handler  could  be  started  immediately  after  the  Freeze  mode  handler  completes. 
The  alternative  code,  shown  here,  causes  the  signal  to  be  added  to  a  stack  of  signals 
waiting  for  processing.  Both  methods  can  coexist,  setting  the  sig_sig  variable  to  a 
signal  number  ORed  with  0x8000,0000  indicates  the  signal  should  be  queued  (if  nec¬ 
essary)  rather  than  processed  immediately. 

First,  examine  the  two  interrupt  handlers  shown  below.  The  Freeze  mode  han¬ 
dlers,  uart_handler()  and  timer_handler(),  use  the  Interrupt  keyword.  They  both 
qualify  for  Freeze  mode  execution.  The  UART  handler,  is  similar  to  the  example  of 
section  2.5.4.  However,  this  time  sig_sig  is  set  to  the  signal  number  held  in  uart_sig 
and  the  most  significant  bit  is  also  set  when  the  end  of  a  string  is  encountered.  This 
will  request  the  associated  signal  handler  to  be  placed  in  the  signal  queue. 

_Interrupt  uart_handler ( )  /*  Freeze  mode  interrupt  handler  */ 

{ 

static  int  count=0; 

recv [ count ] =*uart_p;  /*  access  UART  */ 

if ( recv [ count ] ==' \n' )  /*  end  of  string  ?  */ 

(  count=0; 

sig_sig=0x80000000  1  uart_ sig; 

} 

else  count+t; 

} 

The  Freeze  mode  timer  handler  reloads  the  on-chip  timer  control  registers  for 
repeated  timer  operation.  Each  timer  interrupt  causes  the  tick  variable  to  be  incre¬ 
ment,  and  when  a  tick  value  of  100  is  reached,  signal  timer_sig  is  added  to  the  signal 
queue.  The  Freeze  mode  handler  is  written  in  C.  However,  it  needs  to  access  special 
register  9  (TMR,  the  Timer  Reload  register)  which  is  not  normally  accessible  from  C. 
The  problem  is  overcome  by  using  the  C  language  extensions  _mfsr()  and  _mtsr(). 
They  enable  special  register  to  be  read  and  written. 

—Interrupt  timer_ handler ( )  /*  Freeze  mode  interrupt  handler  */ 

{ 

static  int  tick=0; 
int  tmr; 

tmr=_ mfsr (9) ;  /*  read  TMR  special  register  */ 

tmr=tmr& (-1-0x02000000 )  /*  clear  IN  bit-field  */ 
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/*  write  to  TMR  register  */ 


_mtsr (9, tmr ) ; 
if (tick++  >  100) 
{  tick=0; 


sig_sig=0x80000000 


timer_sig; 


The  second  stage  of  the  UART  interrupt  handler,  the  signal  handler,  is  shown 
below.  Note,  the  sig_uart()  routine  calls  the  printf()  library  routine.  This  is  not  per¬ 
mitted  with  the  High  C  29K  library  routines  as  the  printf()  routine  is  not  reentrant. 
However,  the  use  of  printf()  helps  illustrate  the  operating  principle.  Normally  a  sig¬ 
nal  handler  must  use  the  _sigret()  signal  return  service,  at  least  with  a  HIF  conform¬ 
ing  operating  system.  However,  when  a  signal  handler  is  called  from  the  dispatcher, 
the  signal  return  service  should  not  be  used.  It  is  possible  to  determine  if  the  dispatch¬ 
er  is  in  use  by  testing  the  variable  dispatcher_running;  it  becomes  non  zero  when 
the  dispatcher  is  in  use.  However,  testing  the  dispatcher_running  flag  may  be  insuf¬ 
ficient  in  some  circumstances.  It  is  possible  that  the  Signal  Dispatcher  is  running  and 
initiating  signal  handler  execution.  At  the  same  time  a  signal  handler  may  be  re¬ 
quested  directly  by,  say,  an  interrupt.  The  Dispatcher  is  running  but  the  directly  re¬ 
quested  signal  handler  must  use  the  signal  return  service. 

Signals  need  not  always  be  queued  for  processing.  If  a  very  high  priority  (im¬ 
mediate)  interrupt  occurs  and  further  signal  processing  is  necessary,  sig_sig  should 
be  simply  set  to  the  signal  number.  In  this  case  it  is  important  that  the  signal  handler 
use  the  _sigret()  service. 

_Interrupt  sig_uart (sig_number)  /*  signal  handler  for  UART  */ 
int  sig_number; 

{ 

printf("in  signal  handler  number=%d\n" ,  sig_number) ; 

printf ("received  string=%s\n",  recv_data) ; 

if  (  ! dispatcher_running) _sigret ( )  ;  /*  no  _sigret()  service  call  * 

} 


The  Signal  Dispatcher  is  implemented  as  a  signal  handler.  The  dispatcher  re¬ 
moves  signals  from  a  stack  and  calls  the  appropriate  signal  handler.  When  a  signal 
handler  is  requested  by  a  Freeze  mode  handler,  and  the  Signal  Dispatcher  is  not  cur¬ 
rently  executing,  the  requested  signal  (sig_sig  value)  is  not  immediately  started.  In  its 
place  the  dispatcher  signal  handler  is  initiated. 

Shown  on  Figure  2-4  is  an  example  of  the  Signal  Dispatcher  in  operation.  The 
first  interrupt  is  from  the  UART.  It  is  dealt  with  entirely  in  Freeze  mode;  the  sig_sig 
variable  is  not  set  such  as  to  request  a  second  stage  signal  handler.  The  UART  gener¬ 
ates  the  second  interrupt.  This  time  the  sig_sig  variable  is  set  to  request  the  sig_uart() 
signal  handler  be  started  by  the  Signal  Dispatcher.  While  the  second  stage  handler  is 
running,  a  timer  interrupt  occurs.  The  Freeze  mode  timer  handler  requests  a  second 
stage  handler  be  started  by  the  Signal  Dispatcher.  When  the  dispatcher  completes  the 
currently  executing  second  stage  handler  (the  UART’s),  it  initiates  the  timer’s  second 
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Program  | 


uart_handler()  I 
sig sig  =0 


uart_handler() 
sig__sig=uart_sig  |0x8.. 


Freeze  mode  code 


□  Full  C-context  code 


signal_associate_code 
Push  uart sig  on  stack 


timer_handler() 
sig sig=timer sig|0x8.. 


“signal”=timer_sig 


signal_associate_code 
Push  timer sig  on  stack 


signal”=dispatcher_sig 


sig_uart() 
2nd  stage  signal 
handler 


sig_timer() 
2nd  stage  signal 
handler 


sig_dispatcher() 
Signal  Dispatcher 
Pop  uart_sig  off  stack 


Signal  Dispatcher 
Pop  timer_sig  off  stack 


Main  Pronram 


Signal  Dispatcher 
call sigret() 


Figure  2-4.  Processing  Interrupts  with  a  Signal  Dispatcher 

stage  handler.  When  there  are  no  remaining  second  stage  handler  requests,  the  dis¬ 
patcher  issues  a  signal-return  service  request.  The  original  programs  context  is  then 
restored  and  its  execution  restarted. 

Integer  variable  dispatcher_sig  holds  the  signal  number  used  by  the  Signal 
Dispatcher.  The  user  must  select  a  signal  number.  The  example  code  below  uses  7 
(SIGEMT).  The  signal()  library  routine  is  used  to  assign  procedure  sig_dispatcher() 
to  signal  number  7.  Before  signal  and  trap  handlers  can  be  installed,  the  procedures 
and  variables  defined  in  the  support  libraries  must  be  declared  external;  as  shown 
below. 


extern  void  signal (int,  void  (*handler)  (int)); 

extern  int  interrupt  ( int ,  _Interrupt  (*C_handler)  (int)); 

extern  void  sig_dispatcher ( int ) ; 
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/*  dispatcher  signal  number  */ 


extern  int  sig_sig; 

extern  int  dispatcher_sig; 

int  uart_sig,  timer_sig; 

During  program  initialization,  after  main()  is  called,  the  handler  routines  and 
other  support  services  must  be  installed.  The  code  below  uses  the  interrupt()  library 
routine  to  install  a  signal  handler  (sig_timer()  not  shown)  for  timer  interrupt  support. 
The  call  to  interrupt()  returns  the  allocated  signal  number,  and  this  number  is  saved 
in  timer_sig.  The  timer  Freeze  mode  handler  uses  the  timer_sig  value  to  request  the 
timer  signal  handler  be  executed.  The  interrupt!)  service  is  called  a  second  time  to 
install  the  Freeze  mode  handler,  timer_handler().  The  second  call  causes  vector 
table  entry  14  to  be  reassigned  the  address  of  the  Freeze  mode  handler. 

The  UART  handlers  are  installed  using  an  alternative  method.  The  signal!)  ser¬ 
vice  rather  than  the  interrupt!)  service  is  used  to  assign  the  sig_uart()  signal  handler 
to  signal  number  SIGUSR2.  This  method  allows  a  specific  signal  number  to  be  se¬ 
lected,  rather  than  using  the  interrupt!)  service  to  allocate  the  next  available  signal 
number.  Most  users  will  prefer  the  previous  method  used  to  automatically  select  sig¬ 
nal  numbers. 

main  ( ) 

{ 

_settrap (218,_disable) ; 

_settrap (217,_enable) ; 

_settrap ( 21 6, _timer_init ) ; 

dispatcher_sig=7 ;  /*  select  signal  number  for  dispatcher  */ 

signal (dispatcher_sig, sig_dispatcher) ; 

timer_sig=interrupt (14, sig_timer) ;  /*  install  signal  handler  */ 
if  ( interrupt ( 14 , timer_handler) )  /*  install  Freeze  handler  */ 

print f ("ERROR:  Freeze  mode  handler  not  built  for  trap  14\n"); 

if  ( interrupt ( 15 , uart_handler)  /*  install  Freeze  handler  */ 

print f ("ERROR:  Freeze  mode  handler  not  built  for  trap  15\n"); 
uart_sig=SIGUSR2 ;  /*  chose  a  signal  number  */ 

signal (uart_sig, sig_uart ) ;  /*  install  signal  handler  */ 

timer_init ( ) ;  /*  initialize  the  timer  */ 


The  sig_dispatcher()  requires  two  helper  services,  disable!)  and  enable!). 
They  are  described  in  more  detail  shortly,  but  are  simply  used  to  enable  and  disable 
processor  interrupts.  The  _settrap()  service  is  used  above  to  install  trap  handlers  for 
these  services.  The  timer_init()  routine  is  not  required  by  the  Signal  Dispatcher.  It  is 
included  to  simply  make  the  example  more  complete. 

The  interrupt!)  routine  uses  the  signal_associate  method  of  assigning  a  trap 
number  to  a  signal  handler.  The  code  was  described  in  section  2.5.4,  but  a  few  small 
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additions  are  required  to  support  the  Signal  Dispatcher.  The  modified  code  is  shown 
below.  There  are  two  changes:  Interrupts  are  not  disabled  (requiring  that  a  Freeze 
mode  handler  always  be  used  for  interrupt  deactivation).  A  call  to  queue_sig  is  made 
if  the  most  significant  bit  of  the  signal  number  is  set. 


. reg  itO, gr64 

. reg  it  1 , gr 65 

signal_associate_code  : 


const 

gr  64 

■ ,  0 

const 

itl, 

0 

const 

it2 , 

_sig_sig 

consth 

it2 , 

_sig_sig 

store 

0,0, 

itl , it2 

const 

itl , 

_queue_sig 

consth 

itl, 

_queue_sig 

jmpti 

gr  64 

:  ,  itl 

nop 

push 

msp, 

itO 

push 

msp, 

gr  1 

push 

msp, 

rab 

const 

itO , 

512 

sub 

rab, 

rfb, itO 

pushsr 

msp, 

it 0 , pcO 

pushsr 

msp, 

itO , pci 

pushsr 

msp, 

itO, pc2 

pushsr 

msp, 

itO , cha 

pushsr 

msp, 

itO, chd 

pushsr 

msp, 

itO , chc 

pushsr 

msp, 

itO, alu 

pushsr 

msp, 

itO , ops 

push 

msp, 

tav 

mtsrim 

chc, 

0 

const 

itl , 

RegSigHand 

consth 

itl, 

RegSigHand 

load 

0,0, 

itl, itl 

add 

itO , 

itl,  4 

mtsr 

pci, 

itl 

mtsr 

pcO , 

itO 

iret 

signal_associate_end : 


; signal  number  in  itO 
;push  signal  number  on  stack 

; clear  sig_sig  variable 
;  need  not  do  this  if  signal 
;  handler  is  called  directly 
;  from  vector  table  entry 


; jump  if  msb-bit  set 

; interrupt  context  stack 
;use  'push'  macro' 

;  see  section  3.3.1 

;  set  rab=rfb-WindowSize 

;push  special  registers 


;push  tav  (grl21) 

;  the  trampoline  code  is 
;  described  in  section  4.4.1 
;RegSigHand  is  a  library 
;  variable 
;  IRET  to  signal 
;  trampoline  code 


The  queue_sig  routine  is  shown  below.  It  pushes  the  signal  number  on  a  signal 
stack  and  advances  a  stack  pointer,  sig_stack_p.  The  operation  is  performed  while 
still  in  Freeze  mode  and  is  therefor  not  interruptible.  The  variable 
dispatcher_running  is  then  tested.  If  it  is  set  to  TRUE,  an  interrupt  return  (IRET) 
instruction  is  issued.  If  it  is  FALSE,  the  dispatcher_sig  number  is  obtained  and  the 
signal_associate  code  continues  the  process  of  starting  a  signal  handler;  but  the 
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signal  number  now  in  use  will  cause  the  Signal  Dispatcher  (sig_dispatcher())  to 
commence  execution. 


!_sig: 

; jump  here  from 

signal_associate 

and 

itO, itO, Oxf f 

; clear  msb-bit 

const 

it3 , _sig_stack_p 

consth 

it3 , _sig_stack_p 

load 

0,0, it2 , it3 

; get  pointer  value 

store 

0 , 0 , itO , it2 

; store  signal  number  on  stack 

add 

it2 , it2 , 4 

; advance  stack  pointer 

store 

0,0, it2 , it3 

const 

it 3, _dispatcher_ 

running 

consth 

it 3, _dispatcher_ 

running 

load 

0,0, it2 , it3 

;test  if  signal  dispatcher 

cpeq 

it2 , it2 , 0 

;  already  running 

jmpt 

it 2 ,  _start_dispatcher 

constn 

it2 , -1 

iret 

; IRET  if  running 

start_dispatcher : 

store  0,0,it2,it3  ; set  dispatcher_running 

const  it3 , _dispatcher_sig 
consth  it 3 , _dispatcher_sig 
const  it 1 , _signal_associate_code+5*4 
consth  it  1 , _signal_associate_code+5*4 
jmpi  itl  ; start  signal  handler 

load  0,0,it0,it3  ; signal=dispatcher_sig 


Before  the  signal_associate  code  starts  the  dispatcher  signal  handler,  the 
dispatcher_running  variable  is  set  to  TRUE.  Until  this  variable  is  cleared,  further 
signal  requests  (if  the  most  significant  bit  of  the  signal  number  is  set)  will  be  added  to 
the  queue  of  signals  waiting  for  processing.  The  process  of  adding  a  signal  to  the 
queue  is  kept  simple  —  a  stack  is  used.  Reducing  the  amount  of  code  required  results 
in  less  interrupt  latency  as  the  queue_sig  code  runs  in  Freeze  mode. 

The  signal  handler  which  performs  the  dispatch  operation  is  written  in  C.  The 
code  is  shown  below.  It  requires  some  simple  assembly-level  support  routines  which 
are  described  later.  Having  the  code  in  C  is  a  convenience  as  it  simplifies  the  task  of 
modifying  the  code.  Modification  is  necessary  if  a  different  execution  schedule  is 
required  for  signals  waiting  in  the  signal  stack.  The  variables  used  in  the  Signal 
Dispatcher  routine  are  described  below.  Note,  that  sig_stack_p  and 
dispatcher_running  are  defined  volatile.  This  is  because  they  may  also  be  modified 
by  a  Freeze  mode  interrupt  handler.  It  is  important  that  the  C  compiler  be  informed 
about  this  possibility.  Otherwise  it  may  perform  optimizations  which  prevent  value 
changes  from  being  observed,  such  as  holding  a  copy  of  sig_dispatcher_p  in 
register,  and  repeatedly  accessing  the  register. 
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extern  void  ( *_SigEntry [ ] ) (int);  /*  defined  in  HIF  libraries  */ 


int  sig_stack [ 2 00 ]  ;  /*  signal  stack  */ 

volatile  int  *sig_stack_p=&sig_stack [ 0 ]  ; 

volatile  int  dispatcher_running;  /*  dispatcher  running  flag  */ 
int  sig_sig=0; 

int  dispatcher_sig;  /*  dispatcher  signal  number  */ 

The  example  sig_dispatcher()  is  relatively  simple  but  effective.  It  first  disables 
interrupts  before  removing  all  current  signals  from  the  stack.  The  signal  values  are 
transferred  to  an  array.  Interrupts  are  then  reenabled.  Performing  this  procedure  with 
interrupts  disabled  prevents  other  signals  being  added  to  the  stack  while  the  transfer 
operation  is  being  performed.  Signals  are  transferred  to  the  array  in  the  reverse  order 
they  were  placed  on  the  stack.  This  ensures  that  signals  are  ultimately  processed  in 
the  order  in  which  they  were  originally  requested. 

No  attempt  is  made  to  apply  a  priority  order  to  pending  signals.  The  necessary 
code  can  be  applied  after  the  signals  have  been  removed  from  the  stack.  Performing 
priority  ordering  at  C-level  rather  than  in  the  sig_queue  code  has  the  advantage  of 
reducing  interrupt  latency.  Due  to  the  fast  operation  of  29K  processors  the  need  to 
priority  order  signals  is  not  high,  as  a  signal  request  is  not  likely  to  be  kept  waiting 
very  long. 


void  sig_dispatcher (sig) 
int  sig; 

{ 

int  cps; 

int  *sig_p; 

static  int  sig_array  [ 2 0  ]  ; 
cps=disable (0x20002)  ; 
for (; ; ) 

{  sig_p=&sig_array [ 0 ] ; 

while  (sig  stack  p  !  =&sig. 

{  — sig_stack_p; 

*sig_p++=* (int*) si 

} 

enable ( cps ) ; 

while  (sig  p ! =&sig_array [ ( 
{  — sig_p; 

(*_SigEntry[ (*sig_ 


/*  Signal  Dispatcher  */ 


/*  array  of  signals  */ 

/*  needing  processing  */ 

/*  set  DI  and  TD  in  CPS  */ 

/*  mark  array  empty  */ 
3tack[0])/*  remove  signals  from 
/*  stack  */ 

g_stack_p;  /*  copy  from 

/*  stack  to  array  */ 

/*  enable  interrupts  */ 

)])  /*  process  signals  removed  */ 
/*  from  stack  */ 
p) -1] ) ( *sig_p) ; 


cps=disable ( 0x2 0002 ) ;  /*  disable  interrupts  */ 

if (sig_stack_p==&sig_stack [ 0 ]  )  /*  stack  empty  ?  */ 

break; 


dispatcher_running=0 ; 

enable (cps);  /*  enable  interrupts  */ 

_sigret();  /*  _sigret()  HIF  service  */ 

}  /*  would  restore  interrupted  cps  */ 


When  there  are  no  remaining  signals  to  process,  the  dispatcher  requests  the 
sigret()  signal-return  service.  The  dispatcher_running  flag  is  also  cleared.  It  is 
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possible  that  a  new  signal  arrives  just  after  the  flag  is  cleared  but  before  the 
signal-return  service  is  complete,  this  can  not  be  avoided.  It  does  not  create  a 
problem  (other  than  a  loss  of  performance)  as  a  new  dispatcher  signal  handler  is 
simply  started. 

The  disable()  and  enable()  support  routines  are  used  by  the  Signal  Dispatcher  to 
enable  and  disable  interrupts  around  critical  code.  Interrupts  are  disabled  by  setting 
the  DI  bit  in  the  Current  Processor  Status  (CPS)  register.  Freeze  mode  handler  rou¬ 
tines  can  use  the  _mtsr( )  C  language  extensions  to  modify  special  registers .  However, 
they  can  not  be  used  by  the  dispatcher  routine  as  it  may  be  operating  in  User  mode. 
Accessing  special  register  space  from  User  mode  would  create  a  protection  violation. 
The  problem  is  overcome  by  installing  assembly  level  trap  handlers  which  perform 
the  necessary  special  register  access.  The  _settrap()  HIF  service  is  used  to  install  the 
trap  handlers.  Further  assembly  routines  are  required  to  assert  the  selected  trap  num¬ 
ber.  The  code  for  disabled  is  shown  below. 


.global  _disable 
disable : 

asneq  218, gr96, gr96 
jmpi  IrO 
nop 

.global  _ disable 

_disable : 


mf  sr 

gr96, ops 

; read  OPS 

or 

gr97 , gr96, lr2 

; OR  with  passed  value 

mtsr 

iret 

ops , gr 97 

; copy  OPS  to  CPS 

A  single  parameter  is  passed  to  disable().  The  parameter  is  ORed  with  the  CPS 
value  and  the  CPS  register  updated.  Since  this  task  is  performed  by  a  trap  handler,  the 
OPS  register  is  actually  modified;  and  OPS  is  copied  to  CPS  when  an  IRET  is  issued. 
There  is  a  further  advantage  of  using  a  trap  handler  to  perform  the  task;  the  operation 
can  not  be  interrupted  —  the  read/modify /write  of  the  the  CPS  is  atomic. 

The  code  for  enable()  is  similar  to  disable().  In  this  case  the  passed  parameter  is 
simply  copied  to  the  CPS.  The  disable()  routine  returns  the  CPS  value  before  modi¬ 
fying  it.  The  value  is  normally  stored  and  later  passed  to  enable().  In  this  way  only  the 
DI  and  TD  (timer  disable)  bits  in  the  CPS  are  temporarily  modified.  Note,  older 
members  of  the  29K  family  do  not  support  the  TD  bit.  In  such  case,  the  interrupt  dis¬ 
able  code  used  by  the  example  sig_dispatcher()  routine  does  not  prevent  intermpts 
being  generated  by  the  on-chip  timer.  The  the  problem  can  be  resolved  by  modifying 

the _ enable  and _ disable  assembly  routines  to  clear  and  set  the  interrupt  enable 

(IE)  bit  in  the  Timer  Reload  register. 


.global  _enable 
enable : 

asneq  217, gr96, gr96 
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jmpi  IrO 

nop 

.global  _ enable 

enable : 

mtsr  ops , lr2 

iret 


2.5.7  Minimizing  Interrupt  Latency 

Interrupt  latency  is  minimized  if  interrupts  are  never  disabled.  In  practice  this 
can  be  difficult  to  achieve.  There  are  often  critical  code  sections  which  must  run  to 
completion  without  interruption.  Traditionally,  interrupts  are  disabled  before  enter¬ 
ing  such  code  sections  and  reenabled  upon  critical  section  completion.  However,  if 
interrupts  are  processed  using  the  two-stage  method  described  in  section  2.5.6  (A 
Signal  Dispatcher ),  interrupt  disabling  can  be  eliminated. 

In  place  of  disabling  interrupts  around  a  critical  code  section,  the  Signal  Dis¬ 
patcher  is  effectively  disabled.  This  allows  a  first  stage  interrupt  handler  to  interrupt  a 
critical  code  section.  Second  stage  interrupt  handlers  (signal  handlers)  are  not  initi¬ 
ated  during  the  critical  code  section,  as  the  Dispatcher  is  disabled.  It  is  easy  to  disable 
the  Dispatched  by  simply  indicating  that  it  is  already  active;  this  will  prevent  its  ac¬ 
tivation  which  can  occur  when  the  first  stage  handler  is  completed  (if  the  sig_sig  vari¬ 
able  is  set).  First  stage  handlers  execute  in  Freeze  mode  and  can  be  configured  to 
avoid  access  to  the  shared  resource  being  accessed  by  critical  code  sections.  The  ex¬ 
ample  below  shows  how  the  Signal  Dispatcher  can  be  deactivated  around  a  critical 
code  section. 

tdefine  TRUE  -1 

tdefine  FALSE  0 


.  .  .  interruptible  code 

dispatcher_running=TRUE;  /*  disable  Dispatcher  */ 

.  .  .  start  of  critical  code  section 

/*  code  only  interruptible  by 
Freeze  mode  handler  */ 

.  .  .  end  of  critical  code  section 

dispatcher_running=FALSE;  /*  enable  Dispatcher  */ 

if (sig_stack_p ! =&sig_stack [ 0 ] ) 

_sendsig (dispatcher_sig) ; 


When  the  critical  task  has  been  accomplished,  the  Dispatcher  is  reenabled  by 
clearing  the  dispatcher_running  variable.  It  is  possible  that  one  or  more  signal  num¬ 
bers  were  pushed  on  the  signal  stack  during  the  critical  stage.  Hence,  when  the  Dis¬ 
patcher  is  reenabled,  the  signal  stack  must  be  tested  to  determine  if  there  are  any 
pending  signals.  If  there  are,  then  the  Signal  Dispatcher  must  be  started  using  the 
_sendsig()  HIF  service. 
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The  method  minimizes  the  latency  in  starting  a  Freeze  mode  interrupt  handler 
since  their  commencement  is  never  disabled  —  unless  by  another  Freeze  mode  han¬ 
dler.  The  latency  in  starting  a  second  stage  handler  is  not  reduced.  Further  restrictions 
may  have  to  be  applied  to  first  stage  handlers  to  disallow  access  to  resources  which 
must  be  atomically  manipulated  within  critical  code  sections  —  such  as  linked-list 
data  structures. 

2.5.8  Signal  Processing  Without  a  HIF  Operating  System 

A  signal  processing  technique  is  recommended  for  dealing  with  complex  C  lev¬ 
el  interrupt  handlers.  The  previous  sections  have  described  in  detail  how  signal  pro¬ 
cessing  can  be  performed.  AMD  and  other  tool  providers  supply  the  necessary  sup¬ 
port  code  which  has  been  well  tested  and  is  known  to  be  reliable.  However,  some  de¬ 
velopers  may  select  an  operating  system  which  does  not  support  the  HIF  services  re¬ 
quired  by  the  previous  example  code.  Additionally,  many  embedded  system  are  de¬ 
pendant  on  simple  home-made  boot-up  code,  which  provides  few  support  services. 

A  commercial  operating  system  will  implement  its  own  interrupt  processing 
services.  It  is  likely  these  services  will  be  somewhat  based  on  the  signal  processing 
code  described  in  this  book.  However,  the  provided  services  should  be  used  in  prefer¬ 
ence  to  the  HIF  services.  In  fact,  the  chosen  operating  system  may  not  provide  any 
support  for  HIF  services. 

When  building  simple  boot-up  and  run-time  support  code  for  a  small 
embedded  system,  it  is  best  to  provide  the  necessary  HIF  services  required  for  signal 
processing.  If  the  boot-up  code  is  based  on  AMD’s  OS-boot  product,  then  all  HIF 
services  will  be  provided.  If  OS-boot  is  not  used,  it  is  important  that  limited  HIF 
support  be  included  in  the  developed  code.  Only  the  signal,  settrap,  sysalloc  and 
sigret- type  subset  of  HIF  services  are  required.  A  trap  handler  for  HIF  trap  number 
69  should  be  installed,  and  the  code  required  to  process  the  HIF  service  request 
installed.  Very  little  code  is  required  and  example  code  can  be  taken  from  OS-boot. 

2.5.9  An  Example  Am29200  Interrupt  Handler 

The  following  example  makes  use  of  the  code  presented  in  the  previous  sections 
of  this  chapter.  The  Programmable  I/O  (PIO)  port  of  an  Am29200  microcontroller  is 
configured  such  that  PIO  signal-pin  PIOO  is  an  output,  and  PIO  signal-pin  PIO  15  an 
input.  The  system  hardware  ensures  that  the  two  pins  are  wired  together.  A  two  stage 
interrupt  handler  is  assigned  to  processing  interrupts  generated  by  a  rising  edge  on 
pin  PI015.  By  first  clearing  pin  PIOO  and  then  setting  it  to  one,  an  interrupt  will  be 
generated. 

First,  a  number  of  include  files  must  be  accessed  to  declare  external  data  and 
procedure  type  information.  Newer  versions  of  file  signal.h  contain  the  extern 
declarations  listed  below.  Hence,  only  when  using  an  older  signal.h  file  need  the 
extern  statement  be  explicitly  included. 
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tinclude  <hif.h> 
tinclude  <signal.h> 

extern  int  interrupt  ( int ,  _Interrupt  (*C_handler)  (int)); 

extern  int  sig_sig; 

extern  int  dispatcher_sig; 

extern  void  intr3 (void) ; 

extern  void  _enable (void)  ; 

extern  void  _disable (void)  ; 

extern  void  enable (int); 

extern  int  disable ( int ) ; 

extern  void  sig_dispatcher ( int )  ; 

It  is  best  to  access  the  Programmable  I/O  port  via  support  macros  or  procedures. 
Macros  have  a  speed  advantage  (unless  in-line  procedures  are  used),  and  below  are  a 
number  of  macros  and  support  data  structures  which  simplify  control  of  the  PIO  port 

typedef  volatile  struct  PIO_str  /*  PIO  class  */ 

{ 

unsigned  int  poet; 
unsigned  int  pin; 
unsigned  int  pout; 
unsigned  int  poen; 

}  PIO_t ; 

PIO_t  *PIO_p= (PIO_t*) 0x800000d0;  /*  PIO  object  */ 

/*  ICR  pntr.  */ 

volatile  unsigned  int*  ICR_p= (unsigned  int*) 0x80000028; 

tdefine  PIO_enable_m (port )  PIO_p->poen  |=  (1  <<  (port)) 

tdefine  PIO_disable_m (port)  PIO_p->poen  &=  ~(1  <<  (port)) 

tdefine  PIO_rising_m (port)  \ 

PIO_p->poct  |=  (0x2  <<  (2*  (port)));  \ 

PIO_p->poct  &=  ~(1  <<  (port)); 

tdefine  PIO_falling_m (port )  \ 

PIO_p->poct  |=  (0x2  <<  (2*  (port)));  \ 

PIO_p->poct  |=  (1  <<  (port)); 

tdefine  PIO_high_m (port)  \ 

PIO_p->poct  |=  (0x1  <<  (2*  (port)));  \ 

PIO_p->poct  |=  (1  <<  (port)); 

tdefine  PIO_out_m (port ,  val)  \ 

{  unsigned  int  tmp  =  PIO  p->pout;  \ 
tmp  &=  ~(1  <<  (port));  \ 
tmp  |=  (((val)  &  1)  <<  (port));  \ 

PIO_p->pout  =  tmp;  \ 

} 

tdefine  ICR_clear_m (vec)  *ICR_p  |=  (1<< (251- (vec) ) ) 
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Using  the  Interrupt  keyword,  first  and  second  stage  interrupt  handlers  are 
defined  below  for  the  PI015  interrupt.  No  real  work  is  performed  by  the  example 
second  stage  handler,  but  it  does  demonstrate  how  a  full-C-context  handler  can  be 
reached.  The  second  stage  handler  does  not  qualify  as  a  Freeze  mode  interrupt 
handler  because  it  is  not  a  leaf  routine. 

int  PI015_sig;  /*  signal  number  allocated  to  second  stage  */ 

_Interrupt  PI015_handler ( )  /*  first  stage  interrupt  handlers  */ 

{ 

ICR_clear_m (228) ;  /*  clear  interrupt  request  */ 

PIO_out_m (0, 0) ;  /*  clear  PIOO  port  bit  */ 

sig_sig=0x80000000 | PI015_sig;  /*  request  secnd  stage  */ 

} 

_Interrupt  sig_PI015 (sig_number)  /*  second  stage  handlers  */ 
int  sig_number; 

{ 

printf ("Running  PI015  signal  handler\n"); 

} 

Before  the  interrupt  mechanism  can  be  put  to  work,  the  various  support  handlers 
must  be  installed  as  shown  below.  The  program  is  being  developed  with  the 
MiniMON29K  DebugCore  and  this  requires  that  the  OS-boot  support  interrupt 
handlers  be  preserved  before  the  new  interrupt  handlers  are  added.  The  PIO  support 
macros  are  then  used  to  establish  the  correct  PIO  port  operation  before  the  an 
interrupt  is  generated  by  forcing  a  0-1  level  transition  on  PIO  pin  PIOO. 

int  main ( ) 

{ 

void  (*V_minimon)  (); 

V_minimon= ( void ( * ) ( ) ) _settrap ( 19, intr3) ;  /*  INTR3  */ 

_settrap ( 22 0+24 , V_minimon) ;  /*  MiniMON  support  interrupts  */ 
_settrap (220+25, V_minimon) ;  /*  see  section  2.5.5  */ 

_settrap (220+26, V_minimon) ; 

_settrap (218,_disable) ;  /*  signal  dispatcher  support  */ 

_settrap (217,_enable) ;  /*  see  section  2.5.6  */ 

dispatcher_sig=7 ;  /*  signal  number  for  dispatcher  */ 

signal (dispatcher_sig, sig_dispatcher ) ; 

/*  application  interrupt  handlers  for  I/O  port  PI015  */ 
PI015_sig  =  interrupt (228, sig_PI015) ;  /*  second  stage  */ 
if (interrupt (228, PI015_handler) )  /*  first  stage  */ 

printf ("ERROR  installing  Freeze  mode  handler \n" ) ; 

/*  configure  PIO  port  operation  */ 
PIO_p->poct=0;  /*  clear  control  register  */ 

PIO_enable_m ( 0 ) ;  /*  enable  PIOO  output  */ 

PIO_rising_m ( 15 ) ;  /*  PI015  edge  sensitive  */ 

PIO_out_m (0,0) ;  /*  PIOO  =  0  */ 
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PIO_out_m (0, 1) ;  /*  generate  an  interrupt  */ 

} 

Users  of  the  High  C  29K  tool  chain  could  test  the  interrupt  handling  mechanism 
without  first  building  the  necessary  hardware  by  asserting  the  assigned  trap  number 
as  shown  below. 

_ASM("  asneq  228 , grl , grl" ) ;  /*  test  interrupt  mechanism  */ 


2.6  SUPPORT  UTILITY  PROGRAMS 

There  are  a  number  of  important  utility  programs  available  to  the  software  de¬ 
veloper.  These  tools  are  generally  available  on  all  development  platforms  and  are 
shared  by  different  tool  vendors.  Most  of  the  programs  operate  on  object  files  pro¬ 
duced  by  the  assembler  or  linker.  All  linkable  object  files  and  executable  files  are 
maintained  in  AMD  Common  Object  File  Format  (COFF).  This  standard  is  very 
closely  based  on  the  AT&T  standard  used  with  UNIX  System  V.  Readers  wishing  to 
know  more  about  the  details  of  the  format  may  consult  the  High  C  29K  documenta¬ 
tion  or  the  AT&T  Programmer’s  Guide  for  UNIX  System  V.  The  coff.h  include  file 
found  on  most  tool  distributions,  describes  the  C  language  data  structures  used  by  the 
COFF  standard  —  often  described  as  the  COFF  wrappers. 

2.6.1  Examining  Object  Files  (Type  .o  And  a.Out) 

nm29 

The  nm29  utility  program  is  used  to  examine  the  symbol  table  contained  in  a 
binary  COFF  file  produced  by  the  compiler,  assembler  or  linker.  The  format  is 
very  much  like  the  UNIX  nm  utility.  Originally  nm29  was  written  to  supply 
symbol  table  information  to  the  munch29  utility  in  support  of  the  AT&T  C++ 
cfront  program.  A  number  of  command  line  options  have  been  added  to  enable 
additional  information  to  be  printed,  such  as  symbol  type  and  section  type. 

One  useful  way  to  use  nm29  is  to  pipe  the  output  to  the  sort  utility,  for  example: 
“nm29  a.out  I  sort  1  more”;  each  symbol  is  printed  preceded  by  its  value.  The  sort 
utility  arranges  for  symbol  table  entries  to  be  presented  in  ascending  value. 
Since  most  symbols  are  associated  with  address  labels,  this  is  a  useful  way  to 
locate  an  address  relative  to  its  nearest  address  labels. 

munch29 

This  utility  is  used  with  the  AT&T  C++  preprocessor.  This  program  is  known  as 
cfront  and  converts  C++  programs  into  C.  After  the  C++  program  has  been 
converted  and  linked  with  other  modules  and  libraries,  it  is  examined  with 
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nm29  to  determine  the  names  of  any  static  constructor  and  destructor  functions. 
The  C++  translator  builds  these  functions  as  necessary  and  tags  their  names  with 
predefined  character  sequences.  The  output  from  nm29  is  passed  to  munch29 
which  looks  for  constructor  and  destructor  names.  If  found,  munch29  builds  C 
procedures  which  call  all  the  identified  object  constructors  and  destructors. 
Because  the  constructor  functions  must  execute  before  the  application  main() 
program,  the  original  program  is  relinked  with  the  constructor  procedures  being 
called  before  main().  The  main()  entry  is  replaced  with  _main().  This  also 
enables  the  call  to  destructor  procedures  to  be  made  in  _main()  when  main() 
returns. 

Because  G++  is  now  available  for  C++  code  development  (note,  G++  is 
incorporated  into  the  GCC  compiler),  there  is  little  use  being  made  of  the  AT&T 
cfront  preprocessor.  Additionally,  MRI  and  Metaware  are  expected  to  shortly 
have  commercial  C++  compilers  available. 

rdcoff 

The  rdcoff  utility  is  only  available  to  purchasers  of  the  High  C  29K  product. 
This  utility  prints  the  contents  of  a  COFF  conforming  object  file.  Each  COFF 
file  section  is  presented  in  an  appropriate  format.  For  example,  text  sections  are 
disassembled.  If  the  symbol  table  has  not  been  striped  from  the  COFF  file,  then 
symbol  values  are  shown.  The  utility  is  useful  for  examining  COFF  header 
information,  such  as  the  text  and  data  region  start  addresses.  Those  using  GNU 
tools  can  use  the  coff  and  objdump  utilities  to  obtain  this  information. 

coff  This  utility  is  a  shorthand  way  of  examining  COFF  files.  It  reports  a  summary  of 
COFF  header  information,  followed  by  similar  reports  for  each  of  the  sections 
found  in  the  object  file.  The  utility  is  useful  for  quickly  checking  the  link 
mapping  of  a. out  type  files;  especially  when  a  project  is  using  a  number  of 
different  29K  target  systems  which  have  different  memory  system  layouts, 
requiring  different  program  linkage. 

objdump 

This  utility  is  supplied  with  the  GNU  tool  chain.  It  can  be  used  to  examine 
selected  parts  of  object  files.  It  has  an  array  of  command  line  options  which  are 
compatible  with  the  UNIX  System  V  utility  of  the  same  name.  In  a  similar  way 
to  the  rdcoff  utility  it  attempts  to  format  selected  information  in  a  meaningful 
way. 

swaf 

This  utility  is  used  to  produce  a  General-Purpose  ASCII  (PGA)  symbols  file  for 
use  with  Hewlett-Packard’s  B3740A  Software  Analyzer  tool.  This  tool  enables 
a  16500B  card  cage  along  with  a  selection  of  logic  analyzer  cards  to  support 
high  level  software  debugging.  The  swaf  utility  builds  a  GPA  symbols  file  from 
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information  extracted  from  a  linked  COFF  file.  When  the  GPA  file  is  loaded  into 
the  analyzer  it  is  possible  to  display  address  values  in  symbol  format  rather  than, 
say,  hex  based  integers.  Via  a  remote  computer,  the  HP16500B  can  be  used  to 
support  execution  trace  at  source  level 

mksym 

This  utility  is  required  to  build  symbol  table  information  for  the  UDB  debugger. 
The  UDB  debugger  does  not  directly  operate  with  COFF  symbol  information.  A 
mksym  command  is  typically  placed  in  a  makefile ;  after  the  29K  program  has 
been  linked  a  new  symbol  table  file  should  be  built. 


2.6.2  Modifying  Object  Files 

cvcoff 

The  COFF  specification  states  that  object  file  information  is  maintained  in  the 
endian  of  the  host  processor.  This  need  not  be  the  endian  of  the  target  29K 
processor.  As  described  in  Chapter  1,  29K  processors  can  run  in  big-  or 
little-endian  but  are  almost  exclusively  used  in  big-endian  format.  Endian 
refers  to  which  byte  position  in  a  word  is  considered  the  byte  of  lowest  address. 
With  big-endian,  bytes  further  left  have  lower  addresses.  Machines  such  as 
VAXs  and  IBM-PCs  operate  with  little-endian;  and  machines  from  SUN  and 
HP  tend  to  operate  with  big-endian. 

What  this  means  to  the  29K  software  developer  is  that  COFF  files  on,  say,  a  PC 
will  have  little-endian  COFF  wrappers.  And  COFF  files  on,  say,  a  SUN 
machine  will  have  big-endian  wrapers,  regardless  of  the  endianness  of  the  29K 
target  code.  When  object  files  or  libraries  containing  object  files  are  moved 
between  host  machines  of  different  endianness,  the  cvcoff  utility  must  be  used 
to  convert  the  endianness  of  the  COFF  wraper  information.  The  cvcoff  utility 
can  also  be  used  to  check  the  endianess  of  an  object  file.  Most  utility  programs 
and  software  development  tools  expect  to  operate  on  object  files  which  are  in 
host  endian;  however,  there  are  a  few  tools  which  can  operate  on  COFF  files  of 
either  host  endianness.  In  practice  this  reduces  the  need  to  use  the  cvcoff  utility. 

strpcoff 

This  utility  can  be  used  to  remove  unnecessary  information  from  a  COFF  file. 
When  programs  are  compiled  with  the  “-g”  option,  additional  symbol 
information  is  added  to  the  COFF  file.  The  strpcoff  utility  can  be  used  to 
remove  this  information  and  any  other  details  such  as  relocation  data  and 
line-number  pointers.  Typically  linkers  have  an  option  to  automatically  strip 
this  information  after  linking.  (Id29  has  the  “-s”  option.)  The  COFF  file  header 
information  needed  for  loading  a  program  is  not  stripped. 
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2.6.3  Getting  a  Program  into  ROM 

After  a  program  has  been  finally  linked,  and  possibly  adjusted  to  deal  with  any 
data  initialization  problems  (see  section  2.3.6),  it  must  be  transferred  into  ROM  de¬ 
vices.  This  is  part  of  the  typical  software  development  cycle  for  embedded  processor 
products.  A  number  of  manufacturers  make  equipment  for  programming  PROM  de¬ 
vices.  They  normally  operate  with  data  files  which  must  be  appropriately  formatted. 
Tektronix  Hex  format  and  Motorola  S3  Records  are  two  of  the  commonly  used  file 
formats.  The  coff2hex  utility  can  be  used  to  convert  the  COFF  formatted  executable 
file  produced  by  the  linker  into  a  new  file  which  is  correctly  formatted  for  the  selected 
PROM  programmer.  If  more  than  one  PROM  is  to  required  to  store  the  program, 
coff2hex  can  be  instructed  to  divide  the  COFF  data  into  a  set  of  appropriate  files.  Al¬ 
ternatively,  this  task  can  be  left  to  more  sophisticated  programming  equipment.  The 
utility  has  a  number  of  command  line  options;  the  width  and  size  of  PROM  devices 
can  be  chosen,  alternatively  specific  products  can  be  selected  by  manufacture  part 
number. 
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Chapter  3 


Assembly  Language  Programming 


Most  developers  of  software  for  the  29K  family  will  use  a  high  level  language, 
such  as  C,  for  the  majority  of  code  development.  This  makes  sense  for  a  number  of 
reasons:  Using  a  high  level  language  enables  a  different  processor  to  be  selected  at 
some  future  date.  The  code,  if  written  in  a  portable  way,  need  only  be  recompiled  for 
the  new  target  processor.  The  ever  increasing  size  of  embedded  software  projects 
makes  the  higher  productivity  achievable  with  a  high  level  language  attractive.  And 
additionally,  the  29K  family  has  a  RISC  instruction  set  which  can  be  efficiently  used 
by  a  high  level  language  compiler  [Mann  et  al  1991b]. 

However,  the  software  developer  must  resort  to  the  use  of  assembly  code  pro¬ 
gramming  in  a  number  of  special  cases.  Because  of  the  relentless  efficiency  of  the 
current  C  language  compilers  for  the  29K,  it  is  difficult  for  a  programmer  to  out-per- 
form  the  code  generating  abilities  of  a  compiler  for  any  reasonably  sized  program. 
For  this  reason  it  is  best  to  limit  the  use  of  assembly  code  as  much  as  possible.  Some 
of  the  support  tasks  which  do  require  assembly  coding  are: 

■  Low-level  support  routines  for  interrupts  and  traps  (see  Chapter  4). 

■  Operating  system  support  services  such  as  system  calls  and  application-task 
context  switching  (see  Chapter  5).  Also,  taking  control  of  the  processor  during 
the  power-up  and  initialization  sequence. 

■  Memory  Management  Unit  trap  ware  (see  Chapter  6). 

■  Floating-point  and  complex  integer  operation  trapware,  where  the  29K  family 
member  does  not  support  the  operation  directly  in  hardware. 

■  High  performance  versions  of  critical  routines.  In  some  cases  it  may  be  possible 
to  enhance  a  routines  performance  by  implementing  assembly  code  short-cuts 
not  identified  by  a  compiler. 
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This  chapter  deals  with  aspects  of  assembly  level  programming.  There  are  some 
differences  between  29K  family  members,  particularly  in  the  area  of  on-chip  periph¬ 
erals  for  microcontrollers.  The  chapter  does  not  go  into  details  peculiar  to  individual 
family  members;  for  that  it  is  best  to  study  the  processor  User’s  Manual. 

The  material  covered  is  relevant  to  all  29K  family  members. 


3.1  INSTRUCTION  SET 

The  Am29000  microprocessor  implements  112  instructions.  All  hardware  im¬ 
plemented  instructions  execute  in  a  single-cycle,  except  for  IRET,  IRETINV, 
LOADM  and  STOREM.  Instruction  format  was  discussed  in  section  1.11.  All 
instructions  have  a  fixed  32-bit  format,  with  an  8-bit  opcode  field  and  3,  8-bit,  oper¬ 
and  fields.  Field-C  specifies  the  result  operand  register  (DEST),  field-A  and  field-B 
supply  the  source  operands  (SRCA  and  SRCB).  Most  instructions  operate  on  data 
held  in  global  or  local  registers,  and  there  are  no  complex  addressing  modes  sup¬ 
ported.  Field-B,  or  field-B  and  field-A  combined,  can  be  used  to  provide  8-bit  or 
16-bit  immediate  data  for  instructions.  Access  to  external  memory  can  only  be  per¬ 
formed  with  the  LOAD[M]  and  STORE[M]  instructions.  There  are  a  number  of 
instructions,  mostly  used  by  operating  system  code,  for  accessing  the  processor  spe¬ 
cial  registers. 

The  following  sections  deal  with  the  different  instruction  classes.  Some  of  the 
instructions  described  are  not  directly  supported  by  all  members  of  the  29K  family. 
In  particular,  many  of  the  floating-point  instructions  are  only  directly  executed  by 
the  Am29050  processor.  If  an  instruction  is  not  directly  supported  by  the  processor 
hardware,  then  a  trap  is  generated  during  instruction  execution.  In  this  case,  the  oper¬ 
ating  system  uses  trapware  to  implement  the  instruction’s  operation  in  software. 
Emulating  nonimplemented  instructions  in  software  means  some  instruction  execu¬ 
tion  speeds  are  reduced,  but  the  instruction  set  is  compatible  across  all  family  mem¬ 
bers. 

3.1.1  Integer  Arithmetic 

The  Integer  Arithmetic  instructions  perform  add,  subtract,  multiply,  and  divide 
operations  on  word-length  (32-bit)  integers.  All  instructions  in  this  class  set  the 
ALU  Status  Register.  The  integer  arithmetic  instructions  are  shown  Tables  3-1  and 
3-2. 

The  MULTIPLU,  MULTIPLY,  DIVIDE,  and  DIVIDU  instructions  are  not  im¬ 
plemented  directly  on  most  29K  family  members,  but  are  supported  by  traps.  To  de¬ 
termine  if  your  processor  directly  supports  these  instructions,  check  with  the  proces¬ 
sor  User’s  Manual  or  the  tables  in  Chapter  1 .  The  Am29050  microprocessor  supports 
the  multiply  instructions  directly  but  not  the  divide  instructions. 
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Table  3-1.  Integer  Arithmetic  Instructions 


Mnemonic 

Operation  Description 

ADD 

DEST  <-  SRCA  +  SRCB 

ADDS 

DEST  <- SRCA  +  SRCB 

IF  signed  overflow  THEN  Trap  (Out  Of  Range) 

ADDU 

DEST  <- SRCA  +  SRCB 

IF  unsigned  overflow  THEN  Trap  (Out  Of  Range) 

ADDC 

DEST  <-  SRCA  +  SRCB  +  C  (from  ALU) 

ADDCS 

DEST  <-  SRCA  +  SRCB  +  C  (from  ALU) 

IF  signed  overflow  THEN  Trap  (Out  Of  Range) 

ADDCU 

DEST  <-  SRCA  +  SRCB  +  C  (from  ALU) 

IF  unsigned  overflow  THEN  Trap  (Out  Of  Range) 

SUB 

DEST  <- SRCA-  SRCB 

SUBS 

DEST  <- SRCA-  SRCB 

IF  signed  overflow  THEN  Trap  (Out  Of  Range) 

SUBU 

DEST  <- SRCA-  SRCB 

IF  unsigned  underflow  THEN  Trap  (Out  Of  Range) 

SUBC 

DEST  <-  SRCA  -  SRCB  -  1  +  C  (from  ALU) 

SUBCS 

DEST  <-  SRCA  -  SRCB  -  1  +  C  (from  ALU) 

IF  signed  overflow  THEN  Trap  (Out  Of  Range) 

SUBCU 

DEST  <-  SRCA  -  SRCB  -  1  +  C  (from  ALU) 

IF  unsigned  underflow  THEN  Trap  (Out  Of  Range) 

SUBR 

DEST  <- SRCB  -  SRCA 

SUBRS 

DEST  <- SRCB  —  SRCA 

IF  signed  overflow  THEN  Trap  (Out  Of  Range) 

SUBRU 

DEST  <- SRCB  -  SRCA 

IF  unsigned  underflow  THEN  Trap  (Out  Of  Range) 

SUBRC 

DEST  <-  SRCB  -  SRCA  -  1  +  C  (from  ALU) 

(continued) 
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Table  3-2.  Integer  Arithmetic  Instructions  ( Concluded ) 


(continued) 


Mnemonic 

Operation  Description 

SUBRCS 

DEST  <-  SRCB  -  SRCA  -  1  +  C  (from  ALU) 

IF  signed  overflow  THEN  Trap  (Out  Of  Range) 

SUBRCU 

DEST  <-  SRCB  -  SRCA  -  1  +  C  (from  ALU) 

IF  unsigned  underflow  THEN  Trap  (Out  Of  Range) 

MULTIPLU 

Q//DEST  <-  SRCA  *  SRCB  (unsigned) 

MULTIPLY 

Q//DEST  <-  SRCA  *  SRCB  (signed) 

MUL 

Perform  one-bit  step  of  a  multiply  operation  (signed) 

MULL 

Complete  a  sequence  of  multiply  steps 

MULU 

Perform  one-bit  step  of  a  multiply  operation  (unsigned) 

DIVIDE 

DEST  <-  (Q//SRCA)/SRCB  (signed) 

Q  <-  Remainder 

DIVIDU 

DEST  <-  (Q//SRCA)/SRCB  (unsigned) 

Q  <-  Remainder 

DIVO 

Intitialize  for  a  sequence  of  divide  steps  (unsigned) 

DIV 

Perform  one-bit  step  of  a  divide  operation  (unsigned) 

DIVL 

Complete  a  sequence  of  divide  steps  (unsigned) 

DIVREM 

Generate  remainder  for  divide  operation  (unsigned) 

3.1.2  Compare 

The  Compare  instructions  test  for  various  relationships  between  two  values. 
For  all  Compare  instructions  except  the  CPBYTE  instruction,  the  comparisons  are 
performed  on  word-length  signed  or  unsigned  integers.  There  are  two  types  of 
compare  instruction.  The  first  writes  a  Boolean  value  into  the  result  register  (selected 
by  the  instruction  DEST  operand)  depending  on  the  result  of  the  comparison.  A 
Boolean  TRUE  value  is  represented  by  a  1  in  the  most  significant  bit  position.  A 
Boolean  FALSE  is  defined  as  a  0  in  the  most  significant  bit.  The  29K  uses  a  global  or 
local  register  to  contain  the  comparison  result  rather  than  the  ALU  status  register. 
This  offers  a  performance  advantage  as  there  is  less  conflict  over  access  to  a  single 
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shared  resource.  Compare  instructions  are  frequently  followed  by  conditional  Jump 
or  Call  instructions  which  depend  on  the  contents  of  the  compare  result  register. 

The  second  type  of  compare  instruction  incorporates  a  conditional  test  in  the 
same  instruction  cycle  accomplishing  the  comparison.  These  type  of  instructions, 
known  as  Assert  instructions,  allow  instruction  execution  to  continue  only  if  the  re¬ 
sult  of  the  comparison  is  TRUE.  Otherwise  a  trap  to  operating  system  code  is  taken. 
The  trap  number  is  supplied  in  the  field-C  (DEST)  operand  position  of  the  instruc¬ 
tion.  Trap  numbers  0  to  63  are  reserved  for  Supervisor  mode  program  use.  If  an  As¬ 
sert  instruction,  with  trap  number  less  than  64  is  attempted  while  the  processor  is  op¬ 
erating  in  User  mode,  a  protection  violation  trap  will  be  taken.  Note,  this  is  will  occur 
even  if  the  assertion  would  have  been  TRUE.  Assert  instructions  are  used  in  proce¬ 
dure  prologue  and  epilogue  routines  to  perform  register  stack  bounds  checking  (see 
Chapter  2).  Their  fast  operation  makes  them  ideal  for  reducing  the  overhead  of  regis¬ 
ter  stack  support.  They  are  also  used  as  a  means  of  requesting  an  operating  system 
support  service  (system  call).  In  this  case  a  condition  known  to  be  FALSE  is  asserted, 
and  the  trap  number  for  the  system  call  is  supplied  in  instruction  field-C.  The 
Compare  instructions  are  shown  in  Tables  3-3  and  3-4. 

The  CPB  YTE  performs  four  comparisons  simultaneously.  The  four  bytes  in  the 
SRCA  operand  are  compared  with  the  SRCB  operand  and  if  any  of  them  match  then 
Boolean  TRUE  is  placed  in  the  DEST  register.  The  instruction  can  be  very  efficiently 
used  when  scanning  character  strings.  In  particular,  the  C  programming  language 
marks  the  end  of  character  strings  with  a  0  value.  Using  the  CPB  YTE  instruction  with 
SRCB  supplying  an  immediate  value  0,  the  string  length  can  be  quickly  determined. 

3.1.3  Logical 

The  Logical  instructions  perform  a  set  of  bit-by-bit  Boolean  functions  on 
word-length  bit  strings.  All  instructions  in  this  class  set  the  ALU  Status  Register. 
These  instructions  are  shown  in  Table  3-5. 

3.1.4  Shift 

The  Shift  instructions  (Table  3-6)  perform  arithmetic  and  logical  shifts  on  glob¬ 
al  and  local  register  data.  The  one  exception  is  the  EXTRACT  instruction  which  op¬ 
erates  on  double-word  data.  When  EXTRACT  is  used,  SRCA  and  SRCB  operand 
registers  are  concatenated  to  form  a  64-bit  data  value.  This  value  is  then  shifted  by 
the  funnel  shifter  by  the  amount  specified  by  the  Funnel  Shift  Count  register  (FC). 
The  high  order  32-bits  of  the  shifted  result  are  placed  in  the  DEST  register.  The  fun¬ 
nel  shifter  can  be  used  to  perform  barrel  shift  and  rotate  operations  in  a  single  cycle. 
Note,  when  the  SRCA  and  SRCB  operands  are  the  same  register,  the  32-bit  operand 
is  effectively  rotated.  The  result  may  be  written  back  to  the  same  register  or  placed  in 
a  different  global  or  local  register  (see  Figure  3-1).  The  funnel  shifter  is  useful  for 
fixing-up  unaligned  memory  accesses.  The  two  memory  words  holding  the  un- 
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Table  3-3.  Compare  Instructions 


Mnemonic 

Operation  Description 

CPEQ 

IF  SRCA  =  SRCB  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

CPNEQ 

IF  SRCA  <>  SRCB  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

CPLT 

IF  SRCA  <  SRCB  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

CPLTU 

IF  SRCA  <  SRCB  (unsigned)  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

CPLE 

IF  SRCA  <=  SRCB  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

CPLEU 

IF  SRCA  <=  SRCB  (unsigned)  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

CPGT 

IF  SRCA  >  SRCB  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

CPGTU 

IF  SRCA  >  SRCB  (unsigned)  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

(continued) 


aligned  data  can  be  loaded  into  global  registers,  and  then  aligned  by  the  EXTRACT 
instruction  into  the  destination  register.  A  code  example  showing  the  rotate  operation 
of  the  funnel  shifter  is  given  below: 


mtsrim  fc, 8  ; rotate  8-bits  left 

extract  gr96, gr97 , gr97  ; source  in  gr97 


Funnel  shift 
count,  FC 


SRCA  operand 


SRCB  operand 


DEST  register 

Figure  3-1.  The  EXTRACT  Instruction  uses  the  Funnel  Shifter 
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Table  3-4.  Compare  Instructions  ( Concluded ) 


(continued) 


Mnemonic 

Operation  Description 

CPGE 

IF  SRCA  >=  SRCB  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

CPGEU 

IF  SRCA  >=  SRCB  (unsigned)  THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

CPBYTE 

IF  (SRCA.BYTEO  =  SRCB.BYTEO)  OR 
(SRCA.BYTE1  =  SRCB.BYTE1)  OR 
(SRCA.BYTE2  =  SRCB.BYTE2)  OR 
(SRCA.BYTE3  =  SRCB.BYTE3)THEN  DEST  <-  TRUE 

ELSE  DEST  <-  FALSE 

ASEQ 

IF  SRCA  =  SRCB  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 

ASNEQ 

IF  SRCA  <>  SRCB  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 

ASLT 

IF  SRCA  <  SRCB  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 

ASLTU 

IF  SRCA  <  SRCB  (unsigned)  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 

ASLE 

IF  SRCA  <=  SRCB  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 

ASLEU 

IF  SRCA  <=  SRCB  (unsigned)  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 

ASGT 

IF  SRCA  >  SRCB  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 

ASGTU 

IF  SRCA  >  SRCB  (unsigned)  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 

ASGE 

IF  SRCA  >=  SRCB  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 

ASGEU 

IF  SRCA  >=  SRCB  (unsigned)  THEN  Continue 

ELSE  Trap  (Vector  Number  -  in  field-C) 
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Table  3-5.  Logical  Instructions 


Mnemonic 

Operation  Description 

AND 

DEST  <-  SRCA  &  SRCB 

ANDN 

DEST  <-  SRCA  &  ~  SRCB 

NAND 

DEST  <-  ~  (SRCA  &  SRCB) 

OR 

DEST  <- SRCA | SRCB 

NOR 

DEST  <-  ~  (SRCA  |  SRCB) 

XOR 

DEST  <-  SRCA  A  SRCB 

XNOR 

DEST  <-  ~  (SRCA  A  SRCB) 

Table  3-6.  Shift  Instructions 


Mnemonic 

Operation  Description 

SLL 

DEST  <-  SRCA  «  SRCB  (zero  fill) 

SRL 

DEST  <-  SRCA  »  SRCB  (zero  fill) 

SRA 

DEST  <-  SRCA  »  SRCB  (sign  fill) 

EXTRACT 

DEST  <-  high-order  word  of  (SRCA//SRCB  «  FC) 

3.1.5  Data  Movement 

The  Data  Movement  instructions  (Tables  3-7  and  3-8)  move  bytes,  half-words, 
and  words  between  processor  registers.  In  addition,  the  LOAD[M]  and  STORE[M] 
instructions  move  data  between  general-purpose  registers  and  external  devices,  me¬ 
mories  or  coprocessor.  The  Am29050  processor  has  two  additional  instructions  not 
shown  in  Table  3-7.  They  are  MFACC  and  MTACC;  and  are  used  to  access  the  four 
double-word  floating  point  accumulators  (see  section  3.3.5). 
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Table  3-7.  Data  Move  Instructions 


Mnemonic 

Operation  Description 

LOAD 

DEST  <-  EXTERNAL  WORD  [SRCB] 

LOADL 

DEST  <-  EXTERNAL  WORD  [SRCB] 
assert  *LOCK  output  during  access 

LOADSET 

DEST  <-  EXTERNAL  WORD  [SRCB] 

EXTERNAL  WORD  [SRCB]  <-  h’FFFFFFFF’, 
assert  *LOCK  output  during  access 

LOADM 

DEST..  DEST  +  COUNT  <- 
EXTERNAL  WORD  [SRCB]  .. 

EXTERNAL  WORD  [SRCB  +  COUNT  *  4] 

STORE 

EXTERNAL  WORD  [SRCB]  <-  SRCA 

STOREL 

EXTERNAL  WORD  [SRCB]  <-  SRCA 
assert  *LOCK  output  during  access 

STOREM 

EXTERNAL  WORD  [SRCB]  .. 

EXTERNAL  WORD  [SRCB  +  COUNT  *  4]  <- 
SRCA  ..  SRCA  +  COUNT 

EXBYTE 

DEST  <-  SRCB,  with  low-order  byte  replaced 
by  byte  in  SRCA  selected  by  BP 

EXHW 

DEST  <-  SRCB,  with  low-order  half-word  replaced 
by  half-word  in  SRCA  selected  by  BP 

EXHWS 

DEST  <-  half-word  in  SRCA  selected  by  BP, 
sign-extended  to  32  bits 

IN BYTE 

DEST  <-  SRCA,  with  byte  selected  by  BP  replaced 
by  low-order  byte  of  SRCB 

INHW 

DEST  <-  SRCA,  with  half-word  selected  by  BP  replaced 
by  low-order  half-word  of  SRCB 

MFSR 

DEST  <-  SPECIAL 

MFTLB 

DEST  <-  TLB  [SRCA] 

MTSR 

SPDEST  <-  SRCB 

(continued) 
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Table  3-8.  Data  Move  Instructions  ( Concluded ) 

(continued) 


Mnemonic 

Operation  Description 

MTSRIM 

SPDEST  <-  0116  (16-bit  date  formed  with  SRCA  and  SCRB  fields 

MTTLB 

TLB  [SRCA]  <-  SRCB 

The  LOAD  and  STORE  instructions  are  most  interesting  (see  Figure  3-2  for  the 
instruction  format).  Instruction  field-C  is  assigned  a  number  of  bit-field  tasks  which 
control  the  external  access  operation.  Bit  CE,  when  set,  indicates  that  the  data  trans¬ 
fer  is  to  coprocessor  space.  AMD  makes  a  floating-point  coprocessor,  Am29027, 
which  was  frequently  used  with  the  Am29000  processor  before  the  Am29050  pro¬ 
cessor  became  available.  Because  the  Am29050  directly  supports  floating-point 
instructions  there  are  no  new  designs  making  use  of  the  Am29027  coprocessor. 


31 


23 


15 


XXXXXXXM 


OPT 


RA 


RB  or  I 


CE 


PA 


UA 


AS  SB 

Figure  3-2.  LOAD  and  STORE  Instruction  Format 


Bit  field  AS  when  set  is  used  to  indicate  that  the  access  is  to  Input/Output  (I/O) 
space.  I/O  space  is  little  used  as  there  is  no  convenient  means  of  accessing  it  from  a 
high  level  language  such  as  C.  For  this  reason  peripheral  devices  are  typically 
mapped  into  external  data  memory  space  rather  than  I/O  space. 

The  PA  and  UA  bits  are  used  by  Supervisor  mode  code;  PA  is  used  by  operating 
systems  which  run  with  address  translation  turned  on,  but  need  to  to  access  an  exter¬ 
nal  memory  physical  address.  When  bit  PA  is  set,  address  translation  is  turned  off  for 
the  LOAD  or  STORE  data  access.  This  is  useful  when  accessing  peripheral  devices. 
When  operating  system  code  wishes  to  access  a  User’s  code  space,  it  sets  the  UA  bit. 
This  causes  the  data  transfer  operation  to  execute  with  User  rather  than  Supervisor 
permission.  If  the  User  mode  program  was  running  with  address  translation  on  then, 
the  PID  field  of  the  MMU  register  is  used  when  checking  TLB  access  permissions. 
Normally  Supervisor  mode  code  operates  with  a  fixed  PID  value  of  zero. 
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The  original  versions  of  the  Am29000  processor  (rev-A  to  rev-B)  did  not  sup¬ 
port  byte  sized  access  to  external  memory.  For  this  reason  bytes  and  half-words  had 
to  be  extracted  from  words  after  they  had  been  read  from  memory;  the  Extract  Byte 
(EXBYTE)  and  Extract  half-word  (EXHW)  instructions  are  supported  by  the  pro¬ 
cessor  for  just  this  purpose.  Additionally,  when  data  objects  smaller  than  a  word  were 
written  to  external  memory,  a  read-modify-write  process  had  to  be  used.  The  Insert 
Byte  (INBYTE)  and  Insert  half-word  (INHW)  instructions  supported  the  process. 

Rev-C  and  later  versions  of  the  Am29000  processor  and  all  other  29K  family 
members  directly  support  byte  and  half-word  accesses  to  memory.  The  instructions 
described  above  need  no  longer  be  used.  To  enable  current  versions  of  the  Am29000 
processor  to  be  compatible  with  the  original  processor,  the  DW  bit  was  added  to  the 
processor  configuration  register  (CFG).  When  the  DW  bit  is  clear  the  processor  per¬ 
forms  rev-A  type  memory  accesses.  All  new  designs  operate  with  the  DWbit  set;  and 
other  29K  family  members  operate  with  an  implied  DW  bit  set. 

The  OPT  field  bits  specify  the  size  of  the  data  object  being  moved.  They  are  also 
used  to  indicate  a  word  sized  access  to  Instruction  ROM  space  is  requested.  External 
logic  must  be  incorporated  in  a  memory  system  design  if  this  option  is  to  be  sup¬ 
ported.  The  OPT  field  appears  on  the  OPT(2:0)  output  pins  during  the  memory  ac¬ 
cess.  It  is  important  that  the  object  size  is  consistent  with  the  address  boundaries  de¬ 
fined  by  the  lower  bits  of  the  memory  address.  For  example,  if  a  word  sized  access 
(OPT  filed  value  is  0)  is  attempted  with  lower  address  bits  aligned  to  a  byte  boundary 
(A[  1 :0]  not  equal  0)  then  an  unaligned  access  trap  may  occur.  The  Unaligned  Access 
(UA)  bit  of  the  Current  Processor  Status  (CPS)  register  must  be  set  for  the  trap  to  be 
taken.  Additionally,  alignment  checking  is  only  performed  for  instruction  and  data 
memory,  not  for  I/O  or  coprocessor  space  accesses. 

The  SB  bit  is  used  when  reading  bytes  or  half-words  from  external  memory. 
Sub-word  sized  accesses  are  determined  by  the  OPT  field;  the  processor  right-justi- 
fies  the  accessed  data  within  the  destination  register.  The  SB  bit  when  set  causes  the 
remainder  of  the  destination  to  be  sign  extended  with  the  sign  of  the  loaded  data  ob¬ 
ject.  When  SB  is  clear,  the  destination  register  value  is  zero-extended.  The  SB  bit  has 
no  effect  during  external  memory  writes.  During  write  operations,  the  data  object  is 
replicated  in  all  positions  of  the  data  bus.  For  example,  a  byte  write  would  result  in  the 
stored  byte  appearing  in  all  four  positions  of  the  stored  word.  It  is  the  responsibility  of 
external  memory  to  decode  the  OPT  field  and  lower  address  bits  when  determining 
which  byte  position  should  be  written.  Note,  the  micorcontroller  members  of  the  29K 
family  implement  the  memory  glue  logic  on-chip. 

Instruction  field-B  (SRCB)  supplies  the  external  memory  address  for  LOAD 
and  STORE  instructions.  Typically  a  CONST,  or  CONST  and  CONSTH,  instruction 
sequences  precedes  the  LOAD  or  STORE  instruction  and  establishes  the  access  ad¬ 
dress  for  memory.  However,  the  first  256  bytes  of  memory  can  be  accessed  with  im¬ 
mediate  addressing,  where  the  8-bit  SRCB  value  contains  the  address.  Some  systems 


Chapter  3  Assembly  Language  Programming 


171 


may  be  able  to  make  use  of  this  feature  where  performance  is  critical  and  the  use  of 
CONST  type  instructions  is  to  be  avoided. 

As  described  in  Chapter  1,  the  use  of  LOAD  and  STORE  instructions  can  effect 
the  processor  pipeline  utilization.  Members  of  the  29K  family  which  support  a  Har¬ 
vard  memory  architecture,  or  contain  on-chip  instruction  memory  cache,  can  per¬ 
form  LOAD  and  STORE  operations  in  parallel  with  other  instructions.  This  prevents 
pipeline  stalling,  as  the  instruction  execution  sequence  can  continue  in  parallel  with 
the  external  memory  access.  However,  if  the  instruction  following  a  LOAD  operates 
on  the  accessed  data  then  pipeline  stalling  will  still  occur.  For  this  reason  LOAD 
instructions  should  be  positioned  early  in  the  instruction  sequence,  enabling  the  data 
memory  access  latency  to  be  hidden.  Pipeline  stalling  will  also  occur  if  LOAD  and 
store  type  instructions  are  placed  back-to-back,  as  this  can  result  in  channel  access 
conflicts .  For  this  reason,  LOAD  and  Store  instructions  should  be  separated  with  oth¬ 
er  instructions  as  much  as  possible 

3.1.6  Constant 

The  Constant  instructions  (Table  3-9)  provide  the  ability  to  place  half-word  and 
word  constants  into  registers.  Most  instructions  in  the  instruction  set  allow  an  8-bit 
constant  as  an  operand.  The  Constant  instructions  allow  the  construction  of  larger 
constants.  The  Am29050  processor  has  an  additional  instruction,  CONSTHZ,  not 
supported  in  other  29K  family  members.  It  places  a  1 6-bit  constant  in  the  upper  half¬ 
word  position  while  the  lower  16-bits  are  zero  filled. 


Table  3-9.  Constant  Instructions 


Mnemonic 

Operation  Description 

CONST 

DEST  <-  0116  (1 6-bit  date  formed  with  SRCA  and  SCRB  fields 

CONSTH 

Replace  high-order  half-word  of  SRCA  by  116 

CONSTN 

DEST  <-  1116 
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3.1.7  Floating-point 

The  Floating-Point  instructions  (Tables  3-10  and  3-11)  provide  operations  on 
single-precision  (32-bit)  or  double-precision  (64-bit)  floating-point  data.  In  addi¬ 
tion,  they  provide  conversions  between  single-precision,  double-precision,  and  in¬ 
teger  number  representations.  In  most  29K  family  members,  these  instructions  cause 
traps  to  routines  which  perform  the  floating-point  operations  in  software.  The 
Am29050  processor  supports  all  floating-point  instructions  directly  in  hardware.  It 
also  has  four  additional  instructions  not  shown  in  Tables  3-10  and  3-11.  They  are 
MFAC  ,DMAC  and  FMSM,  DMSM;  and  are  used  to  to  perform  single  and  double¬ 
precision  multiply-and-accumulate  type  instructions  (see  section  3.3.5). 

3.1.8  Branch 

The  Branch  instructions  (Table  3-12)  control  the  execution  flow  of  instructions. 
Branch  target  addresses  may  be  absolute,  relative  to  the  Program  Counter  (with  the 
offset  given  by  a  signed  instruction  constant),  or  contained  in  a  general-purpose  reg¬ 
ister  (indirect  addressing).  For  conditional  jumps,  the  outcome  of  the  jump  is  based 
on  a  Boolean  value  in  a  general-purpose  register.  Only  the  most  significant  bit  in  the 
specified  condition  register  is  tested,  Boolean  TRUE  is  defined  as  bit-31  being  set. 
Procedure  calls  are  unconditional,  and  save  the  return  address  in  a  general-purpose 
register.  All  branches  have  a  delayed  effect;  the  instruction  following  the  branch  is 
executed  regardless  of  the  outcome  of  the  branch. 

The  instruction  following  the  branch  instruction  is  referred  to  as  the  delay  slot 
instruction.  Assembly  level  programmers  may  have  some  difficulty  in  always  find¬ 
ing  a  useful  instruction  to  put  in  the  delay  slot.  It  is  best  to  find  an  operation  required 
regardless  of  the  outcome  of  the  branch.  As  a  last  resort  a  NOP  instruction  can  be 
used,  but  this  makes  no  effective  use  of  the  processor  pipeline.  When  programming 
in  a  high  level  language  the  compiler  is  responsible  for  making  effective  use  of  delay 
slots.  Programmers  not  familiar  with  delayed  branching  often  forget  the  delay  slot  is 
always  executed,  with  unfortunate  consequences.  For  this  reason,  the  example  code 
throughout  this  book  shows  delay  slot  instructions  indented  one  space  compared  to 
other  instructions.  This  has  proven  to  be  a  useful  reminder. 

The  delay  slots  of  unconditional  branches  are  easier  to  fill  than  conditional 
branches.  The  instruction  at  the  target  of  the  branch  can  be  moved  to,  or  duplicated  at, 
the  delay  slot;  and  the  jump  address  can  be  changed  to  the  instruction  following  the 
original  target  instruction. 

The  JMPFDEC  instruction  is  very  useful  for  implementing  control  loops  based 
on  a  decrementing  loop.  The  counter  register  (SRCA)  is  first  tested  to  determine  if  the 
value  is  FALSE,  then  it  is  decremented.  The  jump  is  then  taken  if  a  FALSE  value  was 
detected.  The  code  example  below  shows  how  count  words  of  external  memory  can 
be  written  with  zero.  Note  how  the  address  pointer  is  incremented  in  the  delay  slot  of 
the  jump  instruction.  Additionally,  the  SRCA  register  must  be  initialized  to  count-2 ; 
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Table  3-10.  Floating-Point  Instructions 


Mnemonic 

Operation  Description 

FADD 

DEST  (single-precision)  <-  SRCA  (single-precision) 

+  SRCB  (single-precision) 

DADD 

DEST  (double-precision)  <-  SRCA  (double-precision) 

+  SRCB  (double-precision) 

FSUB 

DEST  (single-precision)  <-  SRCA  (single-precision) 

-  SRCB  (single-precision) 

DSUB 

DEST  (double-precision)  <-  SRCA  (double-precision) 

-  SRCB  (double-precision) 

FMUL 

DEST  (single-precision)  <-  SRCA  (single-precision) 

*  SRCB  (single-precision) 

DMUL 

DEST  (double-precision)  <-  SRCA  (double-precision) 

*  SRCB  (double-precision) 

FDIV 

DEST  (single-precision)  <-  SRCA  (single-precision)/ 

SRCB  (single-precision) 

DDIV 

DEST  (double-precision)  <-  SRCA  (double-precision)/ 

SRCB  (double-precision) 

FEQ 

IF  SRCA  (single-precision)  =  SRCB  (single-precision) 

THEN  DEST  <- TRUE 

ELSE  DEST  <-  FALSE 

DEQ 

IF  SRCA  (double-precision)  =  SRCB  (double-precision) 

THEN  DEST  <- TRUE 

ELSE  DEST  <-  FALSE 

FGE 

IF  SRCA  (single-precision)  >=  SRCB  (single-precision) 

THEN  DEST  <- TRUE 

ELSE  DEST  <-  FALSE 

DGE 

IF  SRCA  (double-precision)  >=  SRCB  (double-precision) 

THEN  DEST  <-TRUE 

ELSE  DEST  <-  FALSE 

FGT 

IF  SRCA  (single-precision)  >  SRCB  (single-precision) 

THEN  DEST  <- TRUE 

ELSE  DEST  <-  FALSE 

(continued) 


174 


Evaluating  and  Programming  the  29K  RISC  Family 


Table  3-11.  Floating-Point  Instructions  ( Concluded ) 


(continued) 


Mnemonic 

Operation  Description 

DGT 

IF  SRCA  (double-precision)  =  SRCB  (double-precision) 

THEN  DEST  <-TRUE 

ELSE  DEST  <-  FALSE 

SQRT 

DEST  (single-precision,  double-precision,  extended-precision) 
<-SQRT[SRCA  (single-precision,  double-precision,  extended-precision) 

CONVERT 

DEST  (integer, single-precision,  double-precision) 

<-SRCA  (integer,  single-precision,  double-precision) 

CLASS 

DEST  (single-precision,  double-precision,  extended-precision) 
<-CLASS[SRCA  (single-precision,  double-precision,  extended-precision)] 

this  is  because  the  loop  is  taken  when  the  count  value  is  0  and  -1,  because  the  count 
decrement  is  performed  after  the  condition  test.  In  practice,  memory  systems  sup¬ 
porting  burst-mode  accesses  could  alternatively  use  a  STOREM  instruction  to  more 
efficiently  clear  data  memory. 


const  gr97, count-2 
const  gr 98 , 0 
const  gr96, address 
consth  gr96, address 
clear:  store  0,  0, gr98, gr96 
jmpfdec  gr97,  clear 
add  gr96, gr96, 4 
; arrive  here  when  loop  finis 


; establish  loop  count 

/establish  memory  address 

/write  zero  to  memory 
/test  and  decrement  count 
/ advance  pointer 
[,  gr  97=-2 


3.1.9  Miscellaneous  Instructions 

The  Miscellaneous  instructions  (Table  3-13)  perform  various  operations  which 
cannot  be  grouped  into  other  instruction  classes.  In  certain  cases,  these  are  control 
functions  available  only  to  Supervisor-mode  programs. 

The  Count  Leading  Zeros  instruction  can  be  very  useful  to  assembly  level  pro¬ 
grammers.  It  determines  the  position  of  the  most-significant  one  bit  in  the  SRCB 
operand.  If  all  bits  are  zero,  then  32  is  returned.  The  instruction  is  useful  when  deter¬ 
mining  priorities  for,  say,  queues  of  interrupt  requests,  where  each  interrupt  may  set  a 
bit  in  the  register  operated  on.  The  highest  priority  interrupt  in  the  queue  can  be 
quickly  determined  by  the  CLZ  instruction. 
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Table  3-12.  Branch  Instructions 


Mnemonic 

Operation  Description 

CALL 

DEST  <-  PC//00  +  8 

PC  <-  TARGET 

Execute  delay  instruction 

CALLI 

DEST  <-  PC//00  +  8 

PC  <-  SRCB 

Execute  delay  instruction 

JMP 

PC  <-  TARGET 

Execute  delay  instruction 

JMPI 

PC  <-  SRCB 

Execute  delay  instruction 

JMPT 

IF  SRCA  =  TRUE  THEN  PC  <-  TARGET 

Execute  delay  instruction 

JMPTI 

IF  SRCA  =  TRUE  THEN  PC  <-  SRCB 

Execute  delay  instruction 

JMPF 

IF  SRCA  =  FALSE  THEN  PC  <-  TARGET 

Execute  delay  instruction 

JMPFI 

IF  SRCA  =  FALSE  THEN  PC  <-  SRCB 

Execute  delay  instruction 

JMPFDEC 

IF  SRCA  =  FALSE  THEN 

SRCA  <- SRCA -1 

PC  <- TARGET 

ELSE 

SRCA  <-  SRCA  -  1 

Execute  delay  instruction 

3.1.10  Reserved  Instructions 

The  remaining  operation  codes  are  reserved  for  instruction  emulation.  These 
instructions  cause  traps,  much  like  the  unimplemented  floating-point  instructions, 
but  currently  have  no  specified  interpretation.  The  relevant  operation  codes,  and  the 
corresponding  trap  vectors  are  given  in  the  processor  User’s  Manual. 

These  instructions  are  intended  for  future  processor  enhancements,  and  users 
desiring  compatibility  with  future  processor  versions  should  not  use  them  for  any 
purpose. 

The  software  developer  should  be  aware  of  the  trap  taken  with  the  reserved 
instruction  opcode  Oxff.  When  execution  is  attempted  with  this  opcode  a  trap  63  is 
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Table  3-13.  Miscellaneous  Instructions 


Mnemonic 

Operation  Description 

CLZ 

Determine  number  of  leading  zeros  in  a  word 

SETIP 

Set  IPA,  IPB,  and  IPC  with  operand  register-numbers 

EMULATE 

Load  IPA  and  IPB  with  operand  register-numbers,  and  Trap  Vector 
Number  in  field-C 

INV  [ID] 

INV  reset  all  Valid  bits  in  instruction  and  data  caches 

INV  1 ;  reset  all  Valid  bits  in  instruction  cache 

INV  2;  reset  all  Valid  bits  in  data  cache 

IRET 

perform  an  interrupt  return  sequence 

IRETINV  [ID] 

IRETINV  perform  an  interrupt  return  and  invalidate  all  caches 

IRETINV  1 ;  perform  an  interrupt  return  and  invalidate  instruction  cache 
IRETINV  2;  perform  an  interrupt  return  and  invalidate  date  cache 

HALT 

Enter  Halt  mode  on  next  cycle 

taken.  This  can  occur  when  a  program  goes  out-of-control  and  attempts  to  fetch 
instructions  from  nonexistent  memory. 


3.2  CODE  OPTIMIZATION  TECHNIQUES 

When  a  high  level  programming  language  is  used  for  software  development, 
code  optimization  is  left  to  the  compiler.  With  assembly  language  programming,  care 
must  be  taken  to  avoid  code  sequences  which  impact  upon  the  processor’s  perfor¬ 
mance.  Section  3.1.5  described  how  LOAD  and  STORE  instruction  must  be  careful¬ 
ly  positioned  if  pipeline  stalling  is  to  be  avoided.  Section  3.1.8  discussed  the  delay 
slot  of  branch  instructions,  and  the  importance  of  finding  a  useful  instruction  for 
delay  slot  filling.  This  section  describes  a  few  more  useful  coding  techniques  which 
can  improve  code  execution  performance. 

Common  Subexpression  Elimination  is  a  technique  where  a  frequently  occur¬ 
ring  code  sequence  is  eliminated  to  only  one  occurrence.  This  usually  requires  the 


Chapter  3  Assembly  Language  Programming 


177 


result  of  the  code  sequence  to  be  held  in  register  space  for  frequent  and  fast  access. 
The  trade-off  between  expression  reevaluation  and  consuming  additional  register 
resources  is  easily  made  with  the  29K  family  because  of  the  large  number  of  general 
purpose  registers  available.  Code  subexpressions  need  not  be  large.  They  may  be  as 
short  as  an  address  calculation  using  a  pair  of  CONST  -  CONSTH  instructions.  The 
calculation  can  be  done  once  and  the  address  kept  around  in  a  register  for  reuse. 

Moving  code  out  of  loops  is  another  technique  frequently  used  to  improve  per¬ 
formance.  However,  the  typically  small  number  of  registers  in  a  CISC  processor  can 
often  mean  loop  invariant  code  results  have  to  be  held  in  external  memory.  This  can 
lead  to  trade-offs  between  adding  code  within  a  loop  or  suffering  the  external 
memory  access  penalties.  Again,  the  large  number  of  general  purpose  registers  in  the 
29K  assist  the  programmer  to  achieve  improved  code. 

Branch  instructions  are  to  be  avoided  as  their  use  impacts  badly  on  perfor¬ 
mance.  Processors  supporting  burst-mode  addressing  operate  most  efficiently  when 
instruction  bursting  is  not  disrupted  with  a  branch  instruction.  This  is  particularly 
true  for  memory  systems  which  have  a  high  first-access  latency.  The  Branch  Target 
Cache  incorporated  in  some  29K  family  members  can  help  hide  the  effects  of  branch¬ 
ing,  but  as  the  number  of  branch  instructions  is  increased  the  chance  of  a  hit  occurring 
in  the  cache  is  reduced. 

Loop  Inversion  is  a  useful  technique  at  reducing  the  use  of  branch  instructions. 
Often  programmers  will  construct  loops  which  have  the  loop  condition  test  at  the  top 
of  the  loop.  This  requires  a  branch  be  used  at  the  bottom  of  the  loop.  If  the  conditional 
branch  is  moved  to  the  bottom  of  the  loop  then  the  number  of  branch  instructions  is 
reduced 

3.3  AVAILABLE  REGISTERS 

In  essence,  global  registers  gr64-gr95  are  reserved  for  interrupt  handlers  and 
the  operating  system  use.  The  remaining  32  global  registers  ( gr96-grl27)  are  re¬ 
served  for  holding  User  mode  program  context.  The  high  level  language  calling  con¬ 
vention  described  in  Chapter  2  established  this  convention.  Figure  3-3  illustrates  the 
partitioning  of  the  global  registers  among  the  operating  system  and  user  program 
tasks.  General  purpose  registers  128-255  are  better  known  as  local  registers,  and  ac¬ 
cessed  via  the  registers  stack  pointer,  grl ,  rather  than  directly  addressed  as  global 
registers.  General  purpose  registers  1 28-255  can  not  be  accessed  like  global  registers 
( gr96-grl27 );  they  can  only  be  accessed  as  local  registers  or  via  indirect  pointers. 

The  calling  convention  goes  further  than  just  dividing  the  register  space  into  two 
groups.  The  user  space  registers  are  assigned  particular  high  level  language  support 
tasks .  All  but  four  registers  ( grll2-grll5 )  in  user  task  space  will  be  accessed  and  mo¬ 
dified  by  compiler  generated  code  at  various  times.  Most  of  the  registers  are  used  as 
compiler  temporaries;  three  registers  are  used  to  support  memory  and  register  stacks; 
and  the  remaining  four  registers  support  the  high  level  language  procedure  call 
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Figure  3-3.  General  Purpose  Register  Usage 


mechanism  and  system  calls.  Global  registers  in  the  range  grl21-grl27  are  known  to 
the  programmer  by  special  synonym;  however  the  registers  themselves  operate  no 
differently  from  other  global  registers. 

In  particular  grl21  (tav)  and  grl 22  (tpc)  are  used  to  pass  arguments  to  trap  rou¬ 
tines  invoked  with  Assert  type  instructions.  This  occurs  during  procedure  prologue 
and  epilogue  as  well  as  operating  system  service  calls.  At  other  times,  the  compiler 
uses  these  registers  to  hold  temporary  data  values. 

Register  grl 23  (Irp)  is  known  as  the  Large  Return  Pointer.  It  is  used  when  a  pro¬ 
cedure  is  required  to  return  an  object  which  is  larger  than  16  words  and  therefore  can¬ 
not  fit  in  the  normal  return  space  (gr96-grlll).  The  caller  must  assign  Irp  to  point  to 
memory  which  will  hold  the  17th  and  higher  words  of  return  data. 

Register  grl24  (sip)  is  known  as  the  Static  Link  Pointer.  It  is  used  when  acces¬ 
sing  data  variables  defined  in  a  parent  procedure.  This  occurs  in  some  languages, 
Pascal  for  example,  where  nested  procedure  declarations  are  permitted.  The  High  C 
29K  and  GNU  compiler  do  not  use  this  register  unless  C  language  extensions  are 
used. 
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A  called  procedure  can  locate  its  dynamic  parent  and  the  variables  of  the  dy¬ 
namic  parent  because  of  the  the  caller-callee  activation  record  linkage  (see  section 
3.5).  However,  the  linkage  is  not  adequate  to  locate  variables  of  the  static  parent 
which  may  be  referenced  in  the  procedure.  If  such  references  appear  in  a  procedure, 
the  procedure  must  be  provided  with  a  local  register  which  forms  part  of  the  static 
link  pointer  chain.  Since  there  can  be  a  hierarchy  of  static  parents,  the  sip  points  to  the 
sip  of  the  immediate  parent,  which  in  turn  points  to  the  sip  of  its  immediate  parent. 

The  memory-stack  support  register  grl  25  ( msp )  and  the  register-stack  support 
registers  grl 26  ( rab ),  grl 27  ( rfb )  and  grl  ( rsp ),  were  discussed  in  detail  in  Chapter  2. 
They  maintain  the  current  positions  of  the  stack  resources. 

The  calling  convention  does  not  assign  any  particular  task  to  the  registers  in  the 
operating  system  (OS)  group  ( gr64-gr95 ).  However,  over  time  a  convention  has 
evolved  among  29K  processor  users. The  subdivision  of  the  OS  registers  shown  in 
Figure  3-3  is  widely  adhered  to.  The  subgroups  are  known  as:  the  interrupt  freeze 
mode  temporaries  (given  synonyms  it0-it3 );  the  operating  system  temporaries 
( ktO-ktll);  and  the  operating  system  statics  support  registers  ( ks0-ksl5 ).  Note,  static 
register  ksO  is  often  combined  with  itO-ktll  to  form  an  interrupt  context  cache  (see 
section  2.5.1).  Consequently,  ksl  is  the  first  available  static  support  register. 

When  developing  a  new  assembly  language  procedure  a  useful  technique  is  to 
construct  a  C  language  routine  which  receives  any  passed  parameters  and  imple¬ 
ments  the  appropriate  task.  With  the  AMD  High  C  29K  compiler,  the  procedure  can 
be  compiled  to  produce  an  assembly  language  output  file  with  the  “-S  -Hanno”  com¬ 
piler  switches.  The  Assembly  level  code  can  then  be  directly  modified  into  the  re¬ 
quired  code  sequence. 

3.3.1  Useful  Macro-Instructions 

The  code  examples  shown  in  later  chapters  make  use  of  a  number  macros  for 
pushing  and  popping  special  registers  to  an  external  memory  stack.  A  macro  instruc¬ 
tion  is  composed  of  a  sequence  of  simpler  instructions.  Effectively,  a  macro  is  an  in¬ 
line  procedure  call.  Using  macros  is  faster  than  making  an  actual  procedure  call  but 
consumes  more  instruction  memory  space.The  macro  definitions  are  presented  be¬ 
low: 


.macro  pushsr, sp, reg, sreg 
mfsr  reg, sreg 

sub  sp, sp, 4 

store  0,0,  reg,  sp 

.  endra 

.macro  popsr, sreg, reg, sp 
load  0,0, reg, sp 

add  sp, sp, 4 

mtsr  sreg, reg 

.  endm 


; macro  name  and  parameters 
; copy  from  special 
/decrement  pointer 
; store  on  stack 


/macro  name  and  parameters 
/get  from  stack 
/ increment  pointer 
/move  to  special 
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Note  how  the  LOAD  instruction  is  used  first  when  poping.  This  enables  the 
ADD  and  MTSR  instruction  to  overlap  the  LOAD  execution  and  thus  reduce  pipe¬ 
line  stalling.  This  is  particularly  useful  when  popsr  macro  instructions  are  used 
back-to-back  in  sequence.  Such  sequences  are  useful  when  a  memory  system  does 
not  support  burst  mode  addressing.  If  bust  mode  is  supported  then  it  can  be  more  effi¬ 
cient  to  use  a  LOADM  instruction  and  then  transfer  the  global  register  date  into  the 
special  registers.  However,  LOADM  and  STOREM  cannot  be  used  in  Freeze  mode 
code  which  frequently  require  popsr  and pushsr  instruction  sequences .  Similar  mac¬ 
ros  are  used  to  push  and  pop  global  registers: 


.macro  push,sp,reg 

sub  sp, sp, 4 

store  0,0,reg,sp 

.  endra 

.macro  pop, reg, sp 

load  0,0, reg, sp 

add  sp, sp, 4 

.  endm 


; macro  name  and  parameters 
/decrement  pointer 
; store  on  stack 


/macro  name  and  parameters 
/get  from  stack 
/ increment  pointer 


3.3.2  Using  Indirect  Pointers  and  grO 

Three  of  the  29K  special  registers  are  known  as  indirect  pointers:  IPA,  IPB,  and 
IPC.  These  registers  are  used  to  point  into  general  purpose  register  space,  and  support 
indirect  register  access.  They  hold  the  absolute  register  number  of  the  general  pur¬ 
pose  register  being  accessed,  and  are  used  in  instructions  by  referencing  the  pseudo¬ 
register  grO.  When  an  indirect  pointer  is  to  be  used  to  identify  an  instruction  operand, 
grO  is  placed  in  the  appropriate  instruction  operand  field.  Indirect  pointer  IPA  is  used 
with  the  SRCA  operand  field.  Similarly,  IPB  and  IPC  apply  with  the  SRCB  operand 
and  DEST  instruction  fields. 

The  indirect  pointer  registers  are  set  with  the  SETIP  and  EMULATE  instruc¬ 
tions.  Additionally,  they  are  set  when  a  trap  is  taken  as  a  result  of  executing  an 
instruction  which  is  not  directly  supported  by  the  29K  processor.  With  some  family 
members  this  occurs  with  floating-point  operations  and  integer  multiply  and  divide. 
The  example  code  below  shows  how  the  grO  register  is  used  to  select  indirect  pointer 
use.  Note,  indirect  pointers  can  not  be  accessed  in  the  cycle  following  the  one  in 
which  they  are  set;  this  explains  the  NOP  instruction. 


setip  gr98, lr2, gr96 
nop 

add  gr0,gr97,gr0 


/ set  indirect  pointers 
/ delay 

/gr98  =  gr97+gr96 
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The  main  use  of  indirect  pointers  is  to  support  transparent  routines  (see  section 
3.7)  and  instruction  emulation.  With  most  29K  family  members  the  integer  multiply 
instruction  (MULTIPLY)  is  not  directly  supported. 


multiply  lr4 , gr98 , lrlO  ; integer  multiply,  lr4  =  gr98*lr!0 


On  entering  the  trapware  support  routine  for  the  vector  assigned  to  the  MUL¬ 
TIPLY  instruction  (vector  number  32)  the  indirect  pointers  are  set  to  IPC=/r4, 
WA=gr98  and  IPB=/r/0  for  the  above  example.  This  enables  the  trap  handler  to  easi¬ 
ly  and  efficiently  access  the  register  operands  for  the  instruction  without  having  to 
examine  the  actual  instruction  in  memory. 

When  using  a  MTSRIM  instruction  to  set  an  indirect  pointer  register  value,  it  is 
important  to  remember  that  the  most  significant  bit  (bit  position  9)  must  be  set  if  local 
registers  are  to  be  accessed.  This  is  because  indirect  pointers  operate  with  absolute 
register  numbers.  See  the  following  section  discussing  the  use  of  grl  for  more  details 
on  register  addressing. 


3.3.3  Using  grl 


Global  register  grl  performs  the  special  task  of  supporting  indirect  access  of  the 
128  local  registers.  When  an  instruction  operand,  say  SRCA,  has  its  top  most  bit  set 
then  base-plus-offset  addressing  is  used  to  access  the  operand.  This  means  only  gen¬ 
eral  purpose  registers  in  the  range  grl-grl27 can  be  addressed  via  their  absolute  reg¬ 
ister  number  given  the  supported  instruction  operand  decoding.  (Indirect  pointers  en¬ 
able  all  general  purpose  registers  to  be  accessed  via  absolute  address  numbers.)  The 
lower  7-bits  of  the  operand  supply  the  offset  which  is  shifted  left  2-bits  then  added 
with  the  local  register  base  held  in  register  grl .  Register  grl  is  a  32-bit  register,  and 
bits  8-2  contain  the  local  register  base  (see  Figure  3-4). 


Figure  3-4.  Global  Register  grl  Fields 

The  base  offset  calculation  is  performed  modulo-128.  The  most  significant  ad¬ 
dress  bit  is  assumed  set  when  forming  the  absolute  address  for  all  local  register  ac¬ 
cesses. 

29K  processors  use  a  shadow  copy  of  the  grl  register  when  performing  local 
register  addressing.  The  shadow  copy  can  only  be  modified  by  an  arithmetic  or  log¬ 
ical  instruction;  a  shift  or  load  into  grl  will  not  update  the  shadow  copy.  Because  of 
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the  shadow  register  technique,  there  must  be  a  delay  of  one  cycle  before  the  register 
file  can  be  accessed  after  grl  has  been  modified. 

3.3.4  Accessing  Special  Register  Space 

The  special  registers  control  the  operation  of  the  processor.  They  are  divided 
into  two  groups:  those  that  can  be  accessed  only  in  Supervisor  mode  and  those  which 
have  unrestricted  access.  Access  of  special  registers  srl28  and  above  do  not  generate 
a  protection  violation  when  accessed  in  User  mode.  Special  register  space  was  de¬ 
scribed  in  section  1.10.2.  Not  all  29K  family  members  have  fully  implemented  spe¬ 
cial  register  spaces.  In  the  Supervisor-only  accessible  space  there  are  a  number  of 
differences  due  to  differences  in  on-chip  resources  such  as  cache  memory  and  hard¬ 
ware  breakpoint  registers.  Because  these  are  not  accessible  to  application  code  they 
do  not  effect  application  code  portability. 

However,  some  members  of  the  29K  family  do  not  implement,  in  hardware,  all 
of  the  special  registers  accessible  by  User  mode  programs.  In  particular  the  floating 
point  support  registers  ( srl60-srl62 )  are  only  implemented  on  processors  which  di¬ 
rectly  support  floating-point  instructions  in  hardware.  All  other  family  members 
virtualize  these  registers.  An  attempted  access  to  unimplemented  special  registers 
causes  a  Protection  Violation  trap  to  occur.  The  trapware  code  implements  the  access 
and  returns  the  result.  Unfortunately,  the  trapware  code  does  not  use  the  indirect 
pointer  as  they  are  not  set  by  a  protection  violation  trap.  This  means  the  trapware 
must  read  the  instruction  space  to  determine  the  special  register  being  accessed.  This 
leads  to  the  consequence  that  the  special  floating  point  support  registers  can  not  be 
virtualized  with  Harvard  memory  architectures  which  do  not  provide  a  memory 
bridge  to  enable  instruction  memory  to  be  accessed  as  data.  The  emulation  technique 
also  requires  the  support  of  three  operating  system  registers.  The  trapware  is  typical¬ 
ly  configured  to  use  global  registers  ks!3-ksl5  ( gr93-gr95 )  for  this  task. 

Special  registers  are  located  in  their  own  register  space.  They  can  only  be  ac¬ 
cessed  by  the  move-from  (MFSR)  and  move-to  (MTSR)  instructions  which  transfer 
data  between  special  register  space  and  general  purpose  registers.  In  addition  there  is 
a  MTSRIM  instruction  which  can  be  used  to  set  a  special  register  with  16-bit  im¬ 
mediate  data.  The  indirect  pointers  can  not  be  used  to  access  special  register  space. 
This  imposes  some  restriction  in  accessing  special  registers  but  in  practice  is  accept¬ 
able.  However,  where  the  address  of  a  special  register  to  be  accessed  is  contained  in  a 
general  purpose  register,  the  technique  shown  below  can  be  used.  In  the  example,  lr2 
contains  the  address  of  the  special  register  to  be  read  with  a  MFSR  instruction.  The 
example  assumes  instruction  memory  can  be  written  to;  the  required  instruction  is 
built  in  gr97  and  stored  in  memory  at  an  address  given  by  gr98.  The  instruction  is 
then  visited  with  a  JMPI  instruction.  A  jump  instructions  target  address  is  visited 
when  the  jump  instruction  contains  a  further  jump  in  its  delay  slot.  The  second  jump 
is  in  the  decode  stage  of  the  processor  pipeline  when  the  first  jump  is  in  execute.  This 
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means  the  second  jump  must  be  taken,  and  only  the  first  instruction  of  the  new 
instruction  stream  is  started  before  execution  continues  at  label  continue. 


const 

gr98, I_memory 

consth 

gr98, I_memory 

const 

gr97, 0xC6600000 

consth 

gr97, 0xC6600000 

sll 

lr2 , lr2 , 8 

or 

lr2 , lr2 , gr 97 

store 

0 , 0 , lr2 , gr 98 

jmpi 

gr  98 

jmp 

continue 

continue : 


/establish  instruction  address 

; MFSR,  DEST=gr96,  SRCA=0 

;  lr2  has  special  register  number 
/instruction  now  constructed 
/store  target  instruction 
/visit  the  target  instruction 
/must  execute  the  delay  slot 


The  constructed  MFSR  instruction  places  the  result  in  register  gr96.  The  lr2 
source  address  value  had  to  be  shifted  left  8-bits  into  the  SRCA  field  position  of  the 
MFSR  instruction. 

3.3.5  Floating-point  Accumulators 

The  Am29050  processor  is  currently  the  only  member  of  the  29K  family  which 
directly  supports  in  hardware  floating-point  arithmetic  operations.  In  addition  to 
supporting  floating-point  operations  without  using  trapware  emulation,  functions 
involving  multiply-and-accumulate  operations  are  supported  by  four  additional 
hardware  instructions  not  implemented  in  other  29K  family  members.  Sum-of- 
product  type  operations  are  frequently  required  by  many  floating-point  intensive  ap¬ 
plications,  such  as  matrix  multiplication.  Implementing  this  operation  efficiently  in 
hardware  makes  the  Am29050  processor  suitable  for  use  in  graphics  and  signal  pro¬ 
cessing  applications. 

The  FMAC  and  DMAC  instructions  can  be  used  to  multiply  two  general  pur¬ 
pose  register  values  together  and  sum  the  product  with  one  of  the  four  floating-point 
accumulators.  The  DMAC  instruction  operates  on  double-precision  operand  data 
and  the  FMAC  operates  on  single-precision.  Double-precision  operands  can  be  ac¬ 
cessed  from  the  register  file  in  a  single  cycle  as  the  register  file  is  implemented  as 
64-bits  wide,  and  there  is  64-bit  wide  ports  supplying  data  to  the  floating-point 
execution  unit  components.  Double-precision  operands  must  be  aligned  on  double¬ 
register  address  boundaries. 

The  FMSM  and  DMSM  instructions  support  single  and  double-precision  float¬ 
ing-point  multiply-and-sum.  One  operand  for  the  multiplication  is  a  general  pur¬ 
pose  register,  the  second  is  accumulator  0;  the  product  is  summed  with  the  second 
instruction  operand  and  the  result  placed  back  in  the  register  file.  These  two  instruc¬ 
tions  can  be  used  when  the  multiplier  is  a  fixed  value  such  as  with  SAXPY  (single¬ 
precision  A  times  X  plus  Y). 

The  Floating-Point  Unit  on  the  Am29050  processor  is  constructed  from  a  num¬ 
ber  of  specialized  operation  pipelines;  one  for  addition/subtraction,  one  for  multi- 
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plication ,  and  one  for  division/square  root.  The  functional  units  used  by  the  pipelines 
all  operate  separately.  This  enables  multiple  floating-point  instructions  to  be  in 
execute  at  the  same  time.  Additionally  floating-point  operations  can  commence  in 
parallel  with  operations  carried  out  by  the  processor’s  integer  pipeline.  The  operation 
of  some  of  the  pipeline  functional  units  can  be  multicycle  and  contention  for  re¬ 
sources  can  result  if  simultaneous  floating-point  operations  are  being  performed. 
However,  all  floating-point  operations  are  fully  interlocked,  and  operations  requir¬ 
ing  the  result  of  a  previous  functional  unit  operation  are  prevented  from  proceeding 
until  that  result  is  available.  The  programmer  never  has  to  become  involved  in  the 
pipeline  stage  details  to  ensure  the  success  of  an  operation. 

To  sustain  efficient  use  of  the  floating-point  pipelines,  four  floating-point  accu¬ 
mulator  registers  are  provided.  The  programmer  must  multiplex  their  use  during 
heavily  pipelined  code  sequences  to  reduce  resource  contention.  The  Am29050  pro¬ 
cessor  can  issue  a  new  floating-point  instruction  every  cycle  but  many  of  the  opera¬ 
tions  have  multicycle  latency.  Thus  to  avoid  pipeline  stalling,  the  results  should  not 
be  used  until  a  sufficient  number  of  delay  cycles  has  passed  (see  Am29050  processor 
User’s  Manual).  The  processor  has  an  additional  64-bit  write  port  on  the  general  pur¬ 
pose  register  file  for  use  by  the  floating-point  unit.  This  enables  floating-point  re¬ 
sults  to  be  written  back  at  the  same  time  as  integer  pipeline  results. 

The  floating-point  accumulators  can  be  accessed  by  the  MTACC  (move-to) 
and  MFACC  (move-from)  instructions  which  are  available  to  User  mode  code.  Only 
29K  family  members  which  directly  support  floating-point  operations  implement 
these  instructions. 

3.4  DELAYED  EFFECTS  OF  INSTRUCTIONS 

Modification  of  some  registers  has  a  delayed  effect  on  processor  behavior. 
When  developing  assembly  code,  care  must  be  taken  to  prevent  unexpected  behav¬ 
ior.  The  easiest  of  the  delayed  effects  to  remember  is  the  one  cycle  that  must  follow 
the  use  of  an  indirect  pointer  after  having  set  it.  This  occurs  most  often  with  the  regis¬ 
ter  stack  pointer.  It  cannot  be  used  to  access  a  local  register  in  the  instruction  that  fol¬ 
lows  the  instruction  that  writes  to  grl.  An  instruction  that  does  not  require  grl  (and 
that  means  all  local  registers  referenced  via  grl )  can  be  placed  immediately  after  the 
instruction  that  updates  grl. 

Direct  modification  of  the  CPS  register  must  also  be  done  carefully.  Particularly 
where  the  freeze  (FZ)  bit  is  cleared.  When  the  processor  is  frozen,  the  special-pur¬ 
pose  registers  are  not  updated  during  instruction  execution.  This  means  that  the  PCI 
register  does  not  reflect  the  actual  program  counter  value  at  the  current  execution  ad¬ 
dress,  but  rather  at  the  point  where  freeze  mode  was  entered.  When  the  processor  is 
unfrozen,  either  by  an  interrupt  return  or  direct  modification  of  the  CPS,  two  cycles 
are  required  before  the  PCI  buffer  register  reflects  the  new  execution  address.  Unless 
the  CPS  register  is  being  modified  directly,  this  creates  no  problem. 
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Consider  the  following  examples.  If  the  FZ  bit  is  cleared  and  trace  enable  (TE)  is 
set  at  the  same  time,  the  next  instruction  should  cause  a  trace  trap,  but  the  PC  buffer 
registers  frozen  by  the  trap  will  not  have  had  time  to  catch  up  with  the  current  execu¬ 
tion  address.  Within  the  trap  code  the  processor  will  have  appeared  to  have  stopped  at 
some  random  address,  held  in  PCI.  If  interrupts  and  traps  are  enabled  at  the  same 
time  the  FZ  bit  is  cleared,  then  the  next  instruction  could  suffer  an  external  interrupt 
or  an  illegal  instruction  trap.  Once  again,  the  PC  buffer  register  will  not  reflect  the 
true  execution  address.  An  interrupt  return  would  cause  execution  to  commence  at  a 
random  address.  The  above  problems  can  be  avoided  by  clearing  FZ  two  cycles  be¬ 
fore  enabling  the  processor  to  once  again  enter  freeze  mode. 


3.5  TRACE-BACK  TAGS 

When  the  compiler  generates  the  code  for  a  procedure,  it  places  a  one  or  two 
word  tag  before  the  first  instruction  of  the  procedure.  The  tag  information  is  used  by 
debuggers  to  determine  the  sequence  of  procedure  calls  and  the  value  of  program 
variables  at  a  given  point  in  program  execution.  The  trace-back  tag  describes  the 
memory  frame  size  and  the  number  of  local  registers  used  by  the  associated  proce¬ 
dure.  A  one  word  tag  is  used  unless  the  memory  stack  usage  is  greater  than  2k  words, 
in  which  case  a  two-word  tag  is  used.  Figure  3-5  shows  the  format  of  the  tag  data. 

Most  of  the  tag  data  fields  are  self  explanatory.  The  M  bit-field  is  set  if  the  the 
procedure  uses  the  memory  stack.  In  such  case,  msize  is  the  memory  stack  frame  size 
in  double  words.  The  argcount  is  the  number  of  in-coming  arguments  in  registers 
plus  two.  The  Tbit-field,  when  set,  indicates  the  routine  is  transparent  (see  section 
3.7). 

When  procedures  are  built  in  assembly  language  rather  than,  say  C,  the  pro¬ 
grammer  is  responsible  for  building  the  appropriate  tag  data  word[s]  ahead  of  the  first 
instruction.  For  an  example  see  section  2.3.5.  Figure  3-6  shows  an  example  register 
stack  history  for  three  levels  of  procedure  calls .  In  the  example,  the  current  procedure 
is  a  small  leaf  procedure.  Small  leaves  differ  from  large  leaf  procedures  in  that  they 
do  not  lower  the  register  stack  and  allocate  new  local  registers  to  the  procedure. 

Fooking  at  the  parent  of  the  current  procedure  it  can  be  seen  the  stack  was  low¬ 
ered  by  six  words  ( rsize )  during  the  parent  procedure  prologue .  The  top  of  the  activa¬ 
tion  record  is  identified  by  the  procedure  Irl  register  value.  In  principal  the  start  of  the 
grandparent  procedure  activation  record  can  be  found  by  subtracting  the  argcount 
value  from  the  address  identified  by  the  parent  Irl .  In  this  way  the  rsize  for  the  parent 
procedure  can  be  determined;  adding  rsize  to  the  parent’s  grl  value  enables  the 
grandparent  grl  value  to  be  obtained.  Repeating  the  mechanism  with  the  grandparent 
Irl  value  allows  all  previous  activation  records  to  be  identified  until  the  first  proce¬ 
dure  call  is  found.  The  first  procedure  is  identified  by  a  tag  value  of  zero,  and  is  nor¬ 
mally  the  start  procedure  in  file  crtO.s. 
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II  M 

argcount 

reserved 

msize 

H 

one-word  tag 


31  23  15  7  0 


Figure  3-5.  Trace-Back  Tag  Format 


However,  there  is  a  problem  with  this  scheme  as  shown  in  Figure  3-6.  Small  leaf 
procedures  do  not  have  Irl  values  for  their  own  activation  record;  they  share  the  Irl 
value  of  their  parent.  Additionally,  large  leaf  procedures  have  a  new  Irl  register  as¬ 
signed,  but  because  leaves  do  not  call  other  procedures,  the  Irl  register  is  not  assigned 
to  point  to  the  top  of  the  activation  record.  For  this  reason,  the  Irl  value  can  not  be 
initially  used  as  a  mechanism  for  walking  back  through  procedure  call  register  al¬ 
location. 

In  practice,  most  debuggers  walk  back  through  instruction  memory  till  they  find 
the  current  procedure  tag  value,  then  they  look  at  the  immediately  following  pro¬ 
logue  code.  The  first  prologue  instruction  is  normally  a  “SUB  grl,  grl,  rsize*4”.  If 
the  rsize  is  bigger  than  64,  then  it  is  a  CONST  followed  by  a  SUB .  In  any  case  the  rsiz.e 
value  is  determined  by  this  method  rather  than  computing  it  from  an  Irl-argcount 
based  calculation. 

Before  the  Am29050  processor,  became  available,  floating-point  intensive  ap¬ 
plications  were  normally  supported  with  an  Am29027  coprocessor.  The  procedure 
call  mechanism  specified  that  coprocessor  float  registers  0-2  are  allowed  to  be  modi¬ 
fied  by  the  callee  and  are  not  saved  across  calls .  Float  registers  3-7  may  also  be  modi¬ 
fied  by  the  callee  but  are  preserved  across  procedure  calls.  Thus  the  caller  must  first 
save  them  before  making  a  call,  and  restore  them  upon  callee  return.  A  region  of  the 
procedure  activation  record  is  assigned  for  saving  the  coprocessor  registers.  Addi¬ 
tionally,  the^p  field  in  the  tag  word  is  used  to  indicate  the  the  number  of  registers 
saved. 
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Figure  3-6.  Walking  Back  Through  Activation  Records 


When  using  an  Am29050  processor  the  fp  field  value  is  always  zero.  The  four 
double-word  accumulator  registers  are  not  preserved  across  a  procedure  call.  If  a 
procedure  uses  the  accumulators  and  wishes  to  preserve  their  contents,  it  must  first 
save  them  before  making  a  procedure  call.  This  may  involve  temporary  modifying 
the  special  floating-point  environment  registers.  Because  the  floating-point  accu¬ 
mulators  are  normally  accessed  by  assembly  language  leaf  routines,  caller  saving  of 
the  accumulators  results  in  a  reduced  overhead  compared  to  callee  saving. 

3.6  INTERRUPT  TAGS 

The  High  C  29K  compiler  will  place  an  additional  tag  word  before  the  normal 
procedure  tag  when  the  key  word  _ Interrupt  is  used  to  define  a  procedure’s  return 
type.  Figure  3-7  shows  the  typical  tag  word  combination  produced.  The  first  proce¬ 
dure  of  an  interrupt  handler,  that  is  the  procedure  accessed  after  the  interrupt  vector  is 
processed  and  any  necessary  preparation  work  is  performed,  should  be  identified  by 
the  Interrupt  key  word.  Examples  of  how  interrupt  tags  are  used  by  application  code 
is  shown  in  section  2.5.4. 

The  second  (or  second  and  third)  word  of  tag  information  has  the  same  format  as 
all  procedure  tags.  Only  the  first  tag  word  is  new  and  this  word  is  known  as  the  inter¬ 
rupt  tag  word.  It  has  several  bit-fields  which  describe  the  execution  environment  re¬ 
quired  by  the  procedure.  These  fields  can  be  examined  at  interrupt  occurrence  time  or 
at  interrupt  installation  time  to  determine  the  appropriate  steps  required  to  prepare 
the  interrupt  processing  environment.  The  objective  is  to  optimize  interrupt  prepara¬ 
tion  by  only  preserving  the  minimum  required  context.  Of  course,  the  exact  steps  tak- 
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normal  one-word  procedure  tag 

Figure  3-7.  Interrupt  Procedure  Tag  Words 


en  will  be  very  much  dependant  on  the  operating  system  in  use.  For  example,  some 
operating  systems  may  process  interrupts  in  User  mode  with  address  translation  in 
use.  Others  may  process  interrupts  in  Supervisor  mode  with  physical  addressing. 

The  C  bit  is  set  if  the  procedure  calls  any  other  procedure  (excluding  transparent 
routines).  That  is,  the  C  bit  is  set  if  the  procedure  is  not  a  child.  When  a  another  proce¬ 
dure  is  called,  it  may  be  necessary  to  have  the  register  stack  repaired  before  the  first 
procedure  is  entered.  The  local  registers  bit-field  indicates  the  number  of  registers 
required  from  the  register  stack  cache.  However,  other  procedures  called  by  the  first 
procedure  may  require  additional  local  registers.  Note  that  large  leaf  routines  may 
require  local  registers  but  of  course  the  C  bit  will  still  not  be  set.  When  the  C  bit  is  set, 
preparation  code  is  unlikely  to  scan  the  other  bit-fields  as  it  is  usually  necessary  to 
assume  that  called  functions  may  perform  any  29K  operation. 

The  F  bit  will  be  set  if  any  floating-point  operations  are  performed.  Most  29K 
family  members  do  not  directly  support  floating-point  instructions  but  take  a  trap 
when  a  floating-point  instruction  is  encountered.  Trap  handlers  can  not  be  entered 
from  Freeze  mode  and  execution  of  a  floating-point  operation  could  modify  the  state 
of  floating-point  accumulators  (Am29050)  or  coprocessor  (Am29027)  status  regi- 
setrs. 

The  I  bit  is  set  if  any  of  the  indirect  pointer  registers  (IPA,  IPB  and  IPC)  are  mo¬ 
dified  by  the  procedure.  These  registers  would  be  effected  by  a  call  to  a  transparent 
helper  routine  which  issues  a  trap.  The  High  C  29K  compiler  uses  a  transparent  rou¬ 
tine  to  perform  integer  multiply  with  most  29K  family  members.  If  the  I  bit  is  set  then 
interrupt  preparation  code  would  be  required  to  preserve  the  indirect  pointer  registers 
before  entering  the  first  procedure.  The  Q  bit  is  set  when  the  Q  register  ( srl31 )  is  mo¬ 
dified.  This  registers  is  used  during  floating-point  and  integer  multiply  and  divide 
emulation  routines. 

The  29K  calling  convention  states  that  a  procedure  return  its  results  in  global 
registers  gr96-grlll .  An  interrupt  handler  routines  does  not  have  any  return  value. 
However,  it  may  use  registers  in  this  range  to  hold  temporary  values  during  proce- 
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dure  execution.  The  return  registers  bit-field  indicates  the  number  of  registers  used. 
Additionally,  temporary  registers  can  be  obtained  from  register  range  grll6-grl20. 
In  fact,  the  GNU  compiler  prefers  allocate  temporary  registers  in  this  range  before 
allocation  from  the  return  registers  range  (see  section  2.5.2).  The  temps,  bit-field  in¬ 
dices  the  number  of  registers  modified  in  the  range  grll6-grl24.  When  interrupt 
processing  is  accomplished  with  a  leaf  routine,  these  bit-fields  enable  only  the  mini¬ 
mum  number  of  global  registers  to  be  saved  before  the  interrupt  handler  procedure  is 
entered. 

3.7  TRANSPARENT  ROUTINES 

Transparent  routines  are  used  to  support  small  highly  efficient  procedure  calls. 
They  are  like  small  leaf  procedures  in  that  they  do  not  lower  the  register  stack  and 
allocate  a  new  activation  record.  They  are  unlike  leaf  procedures  in  that  the  only 
global  registers  which  the  caller  does  not  expect  to  survive  the  call  are  tav  ( grl21 )  and 
tpc  ( grl22 ).  They  are  normally  used  to  support  compiler  specific  support  routines 
such  as  integer  multiply  (where  the  29K  hardware  does  not  directly  support  this  op¬ 
eration). 

Parameters  are  passed  to  transparent  routines  using  tav  and  the  indirect  pointer 
registers.  Return  values  are  via  tpc  and  possibly  the  modified  register  identified  by 
indirect  pointer  ipc.  Leaf  procedures  can  call  transparent  routines  without  changing 
their  status  as  leaf  routines. 

Newer  versions  of  the  High  C  29K  compiler  enable  the  user  to  select  procedures 
for  implementation  as  transparent  routines.  For  example,  a  procedure  which  would 
normally  be  of  return  type  “inf  would  be  defined  (and  declared)  as  type  Transpar¬ 
ent  inf .  The  _ Transparent  key  word  extends  the  C  language.  Of  course  there  are  a 
number  of  restrictions  which  apply  to  transparent  routine  construction:  They  can 
only  receive  two  in-coming  parameters  (passed  via  IPA  and  IPB);  They  must  be  of 
type  void  or  return  an  object  of  word-size  or  less  (return  values  are  passed  via  IPC); 
They  must  not  perform  any  floating  point  (and  some  integer)  operations  which  re¬ 
quire  trapware  support;  And  of  course,  they  must  not  call  any  other  procedures  (even 
if  they  are  transparent). 

3.8  INITIALIZING  THE  PROCESSOR 

Reset  mode  is  entered  when  the  processor’s  *RESET  pin  is  activated.  This 
causes  the  Current  Processor  Status  (CPS)  register  to  be  set  to  the  Reset  mode  values; 
the  processor  operates  in  Supervisor  mode  with  all  data  and  instruction  addresses  be¬ 
ing  physical  (no  address  translation);  all  traps  and  interrupts  are  disabled  and  the  pro¬ 
cessor  Freeze  mode  bit  is  set.  (See  the  Initialization  section  of  the  processor  User’s 
Manual  for  the  exact  CPS  register  setting.)  Individual  29K  family  members  have 
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additional  Reset  mode  conditions  established,  such  as  disabling  cache  memory 
where  appropriate. 

Instruction  execution  begins  at  address  0.  For  processors  supporting  both 
Instruction  memory  space  and  read-only  memory  (ROM)  space,  ROM  space  is  used 
when  fetching  instructions  from  external  memory.  However,  many  Am29000  pro¬ 
cessor  systems  apply  no  distinction  when  decoding  instruction  and  ROM  memory 
space. 

The  programmer  must  establish  control  of  the  processor  and  available  re¬ 
sources.  Section  7.4  discusses  how  this  is  achieved  with  the  OS-boot  operating  sys¬ 
tem.  OS-boot  is  made  available  by  AMD,  and  is  used  to  implement  a  single-task 
application  environment  which  supports  HIF  (see  Chapter  2)  system  call  services. 
Because  OS-boot  is  so  freely  available  to  the  29K  community,  it  is  convenient  to  use 
the  included  processor  start-up  code  sequence  for  any  new  designs. 

3.9  ASSEMBLER  SYNTAX 

Assembly  language,  like  all  languages,  has  a  character  set  and  a  set  of  grammar 
rules.  Purchasers  of  the  ASM29K™  assembly  language  tool  package  from  AMD  or 
other  tool  company,  normally  obtain  a  copy  of  the  assembly  language  syntax  specifi¬ 
cation.  There  are  a  number  of  assembler  tools  available  and  all  of  them  comply  (but 
not  always  fully)  with  the  AMD  defined  syntax  for  assembly  level  programming. 

Many  of  the  assemblers  have  options  which  are  unique,  but  it  has  been  my  expe¬ 
rience  that  assemblers  will  generally  accept  code  which  is  produced  by  any  of  the 
available  compilers. 

3.9.1  The  AMD  Assembler 

The  AMD  assembly  language  tool  package,  ASM29K,  was  used  to  develop  all 
of  the  assembly  language  examples  shown  in  this  book.  The  assembler,  linker  and 
librarian  tools  included  in  the  package  were  developed  by  Microrec  Research  Inc. 
(MRI)  for  AMD.  The  tools  are  available  on  a  number  of  platforms;  the  most  popular 
being  SUN  and  HP  workstations  and  IBM  PCs  and  compatibles.  This  section  does 
not  cover  the  details  of  the  AMD  assembler  (as29)  and  its  options  as  they  are  well 
documented  in  the  literature  supplied  with  each  purchased  tool  package. 

During  the  introduction  of  the  Am29000  processor,  AMD  had  a  second  assem¬ 
bly  level  tool  package  developed  by  Information  Processing  Techniques  Corp.  (IPT). 
This  second  tool  chain  forms  the  basis  of  a  number  of  elaborate  tool  packages  made 
available  by  third  party  tool  suppliers.  All  of  the  tool  suppliers  are  listed  in  the  AMD 
Fusion29Ksm  Catalogue  [AMD  1992a] [AMD  1993b].  Both  assemblers  fully  com¬ 
ply  with  the  AMD  assembler  syntax  for  29K  code.  However,  the  librarian  tools  sup¬ 
plied  with  the  different  tool  packages  maintain  library  code  in  different  formats.  This 
means  libraries  cannot  be  shared  unless  reformatting  is  applied. 
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3.9.2  Free  Software  Foundation  (GNU),  Assembler 

The  Free  Software  Foundation  Inc.  is  an  organization  based  in  Cambridge  MA, 
USA,  which  helps  develop  and  distribute  software  development  tools  for  a  range  of 
processors.  Anyone  can  contribute  programs  to  the  the  foundation  and  users  of 
foundation  supplied  tools  have  the  freedom  to  distribute  copies  of  tools  freely  (or  can 
charge  for  this  service  if  they  wish).  The  foundation  tools  (often  known  as  GNU 
tools)  include  a  complete  tool  chain  for  software  development  for  the  29K  family. 
The  GNU  assembler  is  known  as  GAS,  and  is  available  in  source  form  from  AMD 
and  from  the  Cygnus  Support  company. 

GAS  is  primarily  intended  to  assemble  the  output  from  the  GNU  C  language 
compiler,  GCC  (see  Chapter  2).  It  does  accept  code  complying  with  the  AMD  assem¬ 
bly  language  syntax;  however,  there  are  a  number  of  differences.  Most  notably,  it 
does  not  support  macro  instructions.  Developers  may  wish  to  use  a  UNIX  utility  such 
as  M4  or  CPP  to  support  macros  with  the  GAS  tool  (section  2.5.2  has  an  example  of 
assembler  macros  using  the  C  preprocessor,  CPP). 

A  number  of  developers  have  compiled  GAS  for  use  in  a  cross-development 
environment  where  the  target  processor  is  a  29K,  but  the  development  platform  is  a 
SUN  or  HP  workstation  or  an  IBM  386-PC.  These  tools  are  available  among  the  29K 
GNU  community,  many  of  which  are  university  engineering  departments.  AMD  has 
a  University  Support  Program  which  helps  universities  wishing  to  include  the  29K  in 
educational  programs,  to  obtain  hardware  and  software  development  tools  as  well  as 
other  class  materials.  There  may  be  a  university  near  you  which  will  supply  you  with 
a  copy  of  the  compiled  GNU  tools  for  a  small  tape  handling  charge. 

If  you  get  a  copy  of  GAS  from  AMD  or  Cygnus  or  other  Fusion29K  partner, 
then  it  is  likely  that  the  documentation  supporting  the  tool  was  supplied.  After  instal¬ 
ling  the  tools  on  a  UNIX  machine  and  updating  the  MANPATH  variable  to  include 
the  GNU  manual  pages,  it  should  be  possible  to  just  type  “man  gas”  and  obtain  a  dis¬ 
play  of  the  GAS  program  options.  Alternatively  look  in  the  GAS  source  directories 
for  a  file  called  29k/src/gas/doc/gas.  1  or  as.  1  to  obtain  the  necessary  documentation. 
Below  is  a  extract  from  the  GAS  manual  pages  which  indicates  some  of  the  capabili¬ 
ties  of  the  tool. 

gas  [-a  I  -al  I  -as]  [  —  f ]  [-1  path]  [-K]  [— L]  [-o  objfile]  [-R]  [-v]  [-W] 

files . . . 

OPTIONS 

-a  |  al  I  as 

Turn  on  assembly  listing;  -al,  listing  only;  -as,  symbols,  -a,  everything. 

-f  Fast  — skip  preprocessing  (assume  source  is  compiler  output). 

-I  path 

Add  path  to  search  list  for  .include  directives. 
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-k  Issue  warning  when  difference  tables  altered  for  long  displacements 
-l  Keep  (in  symbol  table)  local  symbols  starting  with  L. 

-o  objfile 

Name  the  object-file  output  for  GAS. 

-r  Fold  data  sections  into  text  sections. 

-v  Announce  GAS  version. 

-w  Suppress  warning  messages. 
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Chapter  4 


Interrupts  and  Traps 


This  chapter  describes  techniques  for  writing  interrupt  and  trap  handlers  for 
29K  processor-based  systems.  It  also  describes  the  interrupt  hardware  for  the  29K 
Processor  Family,  and  the  software  environment  within  which  interrupt  handlers 
execute. 

Handler  descriptions  are  separated  into  two  major  sections.  The  first  discusses 
Supervisor  mode  handlers  and  the  second  covers  User  mode  handlers.  The 
descriptions  apply  equally  well  to  interrupts  and  traps.  For  the  purposes  of  this 
chapter.  User  mode  handlers  refer  to  interrupt  and  trap  handlers  written  in  a 
high-order  language.  However,  it  is  possible  to  enter  User  mode  without  first 
establishing  high-order  language  support.  Additionally,  for  our  purposes  we  shall 
call  assembly  level  handlers  Supervisor  mode  handlers. 

Although  interrupts  are  largely  asynchronous  events,  traps  most  often  occur 
synchronously  with  instruction  execution;  however,  both  share  common  logic  in  the 
29K  Processor  Family  and  are  often  handled  entirely  in  Supervisor  mode,  with  inter¬ 
rupts  disabled  and  Freeze  mode  (described  later)  in  effect.  However,  interrupt  and 
trap  handlers  may  execute  in  one  or  more  of  the  stages  shown  in  Figure  4-1.  Each 
stage  implies  an  increased  level  of  complexity,  and  may  execute  a  return  from  inter¬ 
rupt  (IRET  instruction)  if  the  process  is  complete.  However,  in  the  case  where  User 
mode  has  been  entered,  the  handler  must  first  reenter  Supervisor  mode  before  execut¬ 
ing  an  IRET  instruction. 

The  first  stage  is  entered  when  an  interrupt  occurs.  In  this  stage  the  processor  is 
running  in  Supervisor  mode,  with  Freeze  mode  enabled  and  interrupts  disabled.  In 
the  second  stage  Freeze  mode  is  turned  off  (disabled),  but  the  processor  remains  in 
Supervisor  mode  with  interrupts  disabled.  The  third  stage  execution  takes  place  with 
interrupts  enabled,  but  with  the  processor  still  operating  in  Supervisor  mode.  In  the 
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Figure  4-1.  Interrupt  Handler  Execution  Stages 


fourth  stage,  execution  continues  in  User  mode.  Each  stage  is  discussed  in  the  fol¬ 
lowing  sections  of  this  chapter. 

Before  entering  into  a  discussion  of  Supervisor  mode  interrupts  and  traps,  it  is 
necessary  to  first  understand  the  way  interrupts  are  handled  by  the  29K  family  hard¬ 
ware. 

4.1  29K  PROCESSOR  FAMILY  INTERRUPT  SEQUENCE 

When  an  interrupt  or  trap  occurs  and  is  recognized,  the  processor  initiates  the 
following  sequence  of  steps. 

■  Instruction  execution  is  suspended. 

■  Instruction  fetching  is  suspended. 

■  Any  in-progress  load  or  store  operation,  which  was  not  the  cause  of  a  trap,  is 
completed.  In  the  case  of  load-  and  store-multiple,  any  additional  operations  are 
suspended. 

■  The  contents  of  the  Current  Processor  Status  (CPS)  register  are  copied  into  the 
Old  Processor  Status  (OPS)  register. 

■  The  CPS  register  is  modified  as  shown  below.  The  letter  “u”  means  unaffected, 
and  “r”  indicates  that  this  bit  depends  on  the  value  of  the  RV  bit  in  the  CFG 
register,  or  the  R  bit  in  the  fetched  interrupt  vector.  Note,  only  3-bus  29K 
processors  have  the  R  bit-field  implemented.  The  letter  “f  ’  is  only  supported  by 
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the  Am29040  processor;  it  is  used  hear  to  indicates  the  value  of  the  PD  bit  is 
unaffected  when  taking  a  trap  or  interrupt  if  the  FPD  (Freeze  PD)  bit  is  set  in  the 
CFG  register.  Otherwise,  the  PD  bit  is  set  to  a  1  (see  section  5.14.2). 
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Figure  4-2.  The  Format  of  Special  Registers  CPS  and  OPS 


■  The  setting  of  the  Freeze  (FZ)  bit  freezes  the  Channel  Address  (CHA),  Channel 
Data  (CHD),  Channel  Control  (CHC),  Program  Counters  (PC0-PC2),and  ALU 
Status  registers. 

■  The  address  of  the  first  instruction  of  the  interrupt  or  trap  handler  is  determined. 
If  the  VF  bit  of  the  Configuration  register  is  1,  the  address  is  obtained  by 
accessing  a  vector  from  data  memory.  The  access  is  performed  by  using  the 
physical  address  obtained  from  the  Vector  Area  Base  Address  register  and  the 
vector  number.  If  the  VF  bit  is  0,  the  instruction  address  is  directly  given  by  the 
Vector  Area  Base  Address  register  and  the  vector  number.  For  all  29K 
processors  other  than  3-bus  processors,  the  VF  bit  is  reserved  and  effectively 
set  to  1. 

With  3-bus  processors,  if  the  VF  bit  is  1,  the  R  bit  in  the  vector  fetched  above  is 
copied  into  the  RE  bit  of  the  CPS  register.  If  the  VF  bit  is  0,  the  RV  bit  of  the 
Configuration  register  is  copied  into  the  RE  bit.  This  determines  whether  or  not 
the  first  instruction  of  the  interrupt  handler  is  an  instruction-ROM-space  or 
instruction-space. 

■  An  instruction  fetch  is  initiated  using  the  instruction  address  determined  above. 
At  this  point,  normal  instruction  execution  resumes. 

No  registers  (beyond  the  interrupted  program’s  CPS)  are  saved  when  an  inter¬ 
rupt  occurs.  Any  registers  whose  contents  are  essential  to  restarting  the  interrupted 
program  must  be  deliberately  saved  if  they  are  going  to  be  modified  by  the  interrupt 
handler. 

4.2  29K  PROCESSOR  FAMILY  INTERRUPT  RETURN 

After  the  handler  has  processed  the  interrupt,  and  control  is  given  back  to  the 
interrupted  task,  execution  of  an  IRET  or  IRETINV  instruction  is  used  to  cause  the 
Am29000  processor  to  initiate  the  following  steps. 
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■  Any  in-progress  LOAD  or  STORE  operation  is  completed.  If  a  load-multiple  or 
store-multiple  sequence  has  been  suspended,  the  interrupt  return  is  not 
completed  until  that  operation  is  finished. 

■  Interrupts  and  traps  are  disabled,  regardless  of  the  settings  of  the  DA,  DI,  and  IM 
fields  of  the  CPS  register. 

■  If  the  interrupt  return  instruction  is  an  IRETINV,  the  Valid  bit  associated  with 
each  entry  in  the  Branch  Target  Cache  memory  is  reset.  In  the  case  of 
the  Am29030  processor,  the  IRETINV  instruction  causes  cache  blocks  to 
become  invalid,  unless  the  blocks  are  locked  and  the  cache  is  enabled. 

■  The  contents  of  the  OPS  register  are  copied  into  the  CPS  register.  This  normally 
resets  the  FZ  bit,  allowing  the  Program  Counters  (PC0-PC2)  and  the  CHA, 
CHD,  CHC,  and  ALU  Status  registers  to  update  normally.  The  Interrupt 
Pending  bit  (IP)  of  the  CPS  register  is  always  updated  by  the  processor.  The 
copy  operation  is  irrelevant  for  this  bit. 

■  The  address  in  Program  Counter  1  (PCI)  is  used  to  fetch  an  instruction.  The 
CPS  register  conditions  the  fetch.  This  step  is  treated  as  a  branch,  in  the  sense 
that  the  processor  searches  the  Branch  Target  Cache  memory  for  the  target  of  the 
fetch. 

■  The  fetched  instruction  above  enters  the  decode  stage  of  the  pipeline. 

■  The  address  in  PCO  is  used  to  fetch  an  instruction.  The  CPS  register  conditions 
the  fetch.  This  step  is  treated  as  a  branch,  in  the  sense  that  the  processor  searches 
the  Branch  Target  Cache  memory  for  the  target  of  the  fetch. 

■  The  first  fetched  instruction  enters  the  execute  stage  of  the  pipeline,  and  the 
second  instruction  fetched  enters  the  decode  stage. 

■  If  the  Contents  Valid  (C  V)  bit  of  the  CHC  register  is  1 ,  and  the  Not  Needed  (NN) 
bit  is  1  and  Multiple  Operation  (ML)  bit  is  also  0,  an  external  access  is  restarted. 
If  the  PCI  register  points  to  an  interrupted  load-  or  store-multiple  instruction, 
and  the  ML  bit  is  one,  then  an  interrupted  load-  or  store-multiple  operation  is 
restarted.  The  external  memory  access  is  continued  based  on  the  contents  of  the 
CHA,  CHD,  and  CHC  registers.  The  interrupt  return  is  not  completed  until  this 
operation  is  finished. 

■  Interrupts  and  traps  are  enabled  per  the  appropriate  bits  in  the  CPS  register. 

■  The  processor  resumes  normal  operation. 

It  is  important  to  remember  that  once  an  interrupt  or  trap  occurs,  the  processor  is 

immediately  vectored  to  the  appropriate  handler,  with  interrupts  disabled,  Freeze 

mode  enabled,  and  Supervisor  mode  execution.  The  next  section  discusses  Supervi- 
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sor  mode  interrupt  handlers.  The  final  section  describes  User  mode  interrupt  han¬ 
dlers.  Both  sections  include  29K  Processor  Family  assembly  language  source  code 
examples. 

4.3  SUPERVISOR  MODE  HANDLERS 

4.3.1  The  Interrupt  Environment 

After  an  interrupt  or  trap  occurs,  and  the  event  is  recognized  by  the  processor, 
the  29K  family  hardware  interrupt  sequence,  described  earlier,  is  initiated.  Interrupt 
handler  code  begins  execution  at  this  point. 

The  amount  of  code  necessary  to  handle  an  interrupt  or  trap  depends  on  the  na¬ 
ture  of  the  interruption,  and  the  degree  to  which  a  given  operating  system  supports 
interrupts  and  traps.  For  robust  systems,  interrupt  and  trap  handlers  must  be  sure  to 
return  to  an  environment  guaranteed  to  be  intact  when  their  processing  is  complete. 
Some  systems  may  elect  to  terminate  a  program  if  certain  interrupts  and  traps  occur, 
while  others  may  ignore  these  entirely.  The  operating  system  will  also  set  some  stan¬ 
dards  for  register  availability  in  interrupt  routines.  As  stated  in  the  section  describing 
the  calling  convention  (Chapter  2),  AMD  recommends  that  the  29K  processor’s 
global  registers  gr95  and  below  be  reserved  for  non  User- mode  code.  Additionally 
section  3.3,  of  Assembly  Language  Programming,  goes  further,  and  suggests  an  al¬ 
location  scheme  for  operating  system,  reserved  registers.  (See  Table  4-1.) 


Table  4-1.  Global  Register  Allocations 


Registers  Name 


Description 


grl  rsp 

gr64-67  it0-it3 
gr68-79  ktO-ktll 
gr80-92  ks0-ksl2 
gr93-95  ksl3-ksl5 
gr96-127  various 


Local  register  stack  pointer 
Interrupt  handler  temporaries 
Temporaries  for  use  by  operating  system 
Operating  system  statics 
Floating-point  trap  handler  statics 
Reserved  by  Am29000  processor  calling 
conventions 


In  essence,  global  registers  ( gr64-gr95 )  are  reserved  for  interrupt  handlers  and 
the  operating  system.  The  remaining  32  global  registers  ( gr96-grl27)  are  reserved 
for  holding  the  interrupted  program’s  context. 

Existing  floating-point  trap  handlers  use  gr64-gr78  as  temporary  registers, 
with  interrupts  disabled.  In  addition,  registers  gr93-gr95  are  used  to  hold  static  vari¬ 
ables  for  these  routines.  The  register  assignments  in  these  routines  can  easily  be 
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changed,  but  fifteen  temporary  global  registers  and  three  static  global  registers  must 
be  allocated.  Note,  with  the  Am29050  processor,  only  the  integer  divide  instructions 
are  not  directly  supported  by  the  processor  hardware  and  require  trapware  support. 
This  requires  six  temporary  global  registers  and  no  static  global  registers. 

If  all  of  the  local  registers  are  given  over  to  User-mode  code  use,  then  interrupt 
and  trap  handlers  must  also  assume  that  the  local  registers  are  being  used  and  may  not 
be  arbitrarily  rewritten,  unless  the  values  they  contain  are  saved  upon  entry,  and  are 
restored  prior  to  exit.  If  a  cache  window  size  ( rjb-rab )  less  than  the  physical  register 
file  size  is  used,  then  a  number  of  non-static  temporary  local  registers  can  be  made 
available  for  handler  use. 

Fortunately,  most  interrupt  handlers  can  operate  very  efficiently  using  only  a 
few  temporary  registers.  It  is  recommended  that  global  registers  gr64-gr67  ( it0-it3  ) 
be  allocated  for  this  purpose.  However,  additional  temporary  registers  kt0-kt3  may 
be  used  for  interrupt  handlers  if  these  registers  are  not  used  by  the  operating  system. 

4.3.2  Interrupt  Latency 

The  determination  of  the  number  of  cycles  required  to  reach  the  first  instruction 
of  an  interrupt  or  trap  handler  is  a  little  complicated.  First  consider  the  case  for  the 
non-vector  fetch,  table  of  handlers  method. 

An  external  interrupt  line  may  have  to  be  held  active  for  one  cycle  before  the 
processor  internally  recognizes  it.  Once  recognized,  one  cycle  is  required  to  internal¬ 
ly  synchronize  the  processor.  Now  any  in-progress  load  or  store  must  be  completed 
( Dc  cycles,  where  0  <  Dc  <  Dw,  note  Dw  is  the  number  of  cycles  required  to  complete 
a  data  memory  write  and  is  often  greater  than  Dr ,  the  number  of  cycles  required  to 
complete  a  data  memory  read).  One  cycle  is  then  required  to  calculate  the  vector.  The 
first  instruction  can  then  be  fetched  (Ir  cycles)  and  presented  to  the  instruction  fetch 
unit.  One  cycle  is  required  by  the  fetch  unit  and  a  further  cycle  by  the  decode  unit 
before  the  instruction  reaches  execute.  If  the  first  instruction  is  found  in  the  cache, 
then  the  Branch  Target  Cache  memory  forwards  the  instruction  directly  to  the  decode 
unit.  The  total  latency  (minimum  of  five  cycles  for  the  hit  case)  is  given  by  the  equa¬ 
tion  below. 

delay (miss)  =  1  +  1  +  Dc  +1+Ir+1+1 


delay (hit)  =  1  +  1  +  Dc  +1+1+1 

Now  let’s  consider  the  case  for  a  table  of  vectors,  that  is  the  VF  bit  in  the  CFG 
register  is  set  (always  the  case  for  2-bus  processors  and  microcontrollers).  The  vec¬ 
tor  must  still  be  calculated  and  any  in-progress  load  or  store  completed  before  the 
vector  can  be  fetched  from  data  memory.  Additionaly,  if  the  processor  has  a  data 
cache,  the  cache  state  is  synchronized  after  any  current  data  access  is  completed. 
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Data  cache  synchronizing  is  discussed  in  detail  at  the  end  of  this  section.  The  number 
of  cycles  required  to  read  the  data  memory  is  represented  by  Dr.  Once  the  address  of 
the  handler  has  been  fetched  it  must  be  routed  to  the  processor  PC,  this  takes  one 
cycle.  A  further  cycle  occurs  before  the  address  reaches  the  Address  Pins.  Delays  in¬ 
volved  in  fetching  the  first  instruction  are  then  the  same  as  described  above.  Once 
again,  if  the  first  instruction  is  found  in  the  cache,  the  Branch  Target  Cache  memory 
forwards  the  instruction  directly  to  the  decode  unit.  The  total  latency  (minimum  of 
seven  cycles  for  the  hit  case)  is  given  by  the  equation  below. 

delay(miss)  =  1  +  1  +  Dc  +  <cache  sync.>  +  1  +  Dr  +1+1+  Ir  +1+1 


delay(hit)  =  1  +  1  +  Dc  +  <cache  sync.>  +  1  +  Dr  +1+1+1 

The  Am29050  processor  supports  instruction  forwarding.  This  enables  instruc¬ 
tions  to  be  forwarded  directly  to  the  decode  unit,  bypassing  the  fetch  unit  and  saving 
one  cycle.  The  minimum  latency  for  the  Am29050  processor  for  the  vector  fetch  and 
non-vector  fetch  methods  is  six  cycles  and  four  cycles,  respectively. 

The  Am29040  and  Am2924x  processors  have  data  cache  which  can  add  to  inter¬ 
rupt  latency.  Consider  that  the  Am29240  has  a  two  word  write-buffer  which  must  be 
flushed  before  interrupt  processing  can  be  completed.  This  adds  as  much  as  2 xDw 
cycles  to  interrupt  latency.  The  processor  could  be  performing  a  load  when  inter¬ 
rupted.  If  the  load  caused  a  block  (cache  entry)  to  be  allocated,  then  the  load  would  be 
completed  but  block  allocation  canceled. 

Cache  synchronizing  for  the  Am29040  processor  is  a  little  more  complicated. 
The  worst  case  condition  occurs  when  the  write  buffer  is  full  and  a  load  is  performed. 
The  load  can  cause  block  allocation  and  because  of  the  write-back  policy,  the  se¬ 
lected  block  may  have  to  be  copied-back.  The  Am29040  always  flushes  the  write- 
buffer  before  reloading  a  new  block.  Cache  reload  can  not  be  cancelled  even  if  the 
interrupt  occurs  before  the  write-buffer  is  flushed.  However,  the  loaded  block  will  be 
held  in  the  reload  buffer  (see  Figure  5-9)  and  the  copy-back  buffer  returned  to  the 
cache.  Unfortunately,  the  reload  buffer  contents  will  never  make  it  into  the  cache. 

The  effects  of  data  cache  synchronizing  on  interrupt  latency  are  summarize  be¬ 
low: 


Am29240  <cache  sync . >  =  2  x  Dw 

Am29040  <cache  sync.>  =  (2  x  Dw)  +  (4  x  Dr) 

4.3.3  Simple  Freeze-mode  Handlers 

The  simplest  interrupt  or  trap  handler  will  execute  in  its  entirety  in  Supervisor 
mode,  with  interrupts  disabled,  and  with  the  FZ  bit  set  in  the  CPS  register.  This  corre¬ 
sponds  to  the  first  stage  depicted  in  Figure  4-1. 
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The  FZ  bit  in  the  Current  Processor  Status  register  is  responsible  for  locking  the 
values  in  the  Program  Counters  (PC0-PC2),  the  Channel  registers  (CHA,  CHD  and 
CHC),  and  the  ALU  status.  As  long  as  the  FZ  bit  remains  set,  these  registers  will  not 
be  updated.  Note,  the  PC0-PC2  registers  are  not  the  actual  Program  counter,  but  a 
three-stage  buffer  store  that  records  the  stages  of  program  execution. 

If  the  intention  is  to  ignore  the  interrupt  and  return  control  to  the  interrupted  pro¬ 
cess,  the  entire  handler  can  consist  of  little  more  than  an  IRET  instruction.  After  the 
interrupt  request  has  been  cleared,  execution  of  this  instruction  will  cause  the  proces¬ 
sor  to  perform  the  interrupt  return  sequence  described  above,  resuming  execution  of 
the  interrupted  program  at  the  point  of  intermption. 

4.3.4  Operating  in  Freeze  mode 

Interrupt  or  trap  handlers  executing  only  a  small  number  of  instructions  before 
returning  will  benefit  from  the  very  short  latency  of  the  interrupt  sequence  performed 
by  the  29K  processor.  This  is  because  the  29K  processor  offers  superior  performance 
when  compared  with  conventional  processors  that  save  a  great  deal  of  context  when¬ 
ever  an  interrupt  or  trap  occurs. 

Because  the  executing  program’s  context  is  often  not  disturbed  by  the  interrupt 
or  trap  handling  code,  both  the  reaction  time  (latency)  and  processing  time  of  the  in¬ 
terrupt  handler  are  minimized. 

In  this  context,  no  registers  (except  the  CPS)  have  been  saved  when  an  interrupt 
or  trap  handler  is  given  control  by  the  processor.  In  addition,  if  the  Program  Counter 
registers  (PCO  and  PCI)  are  left  undisturbed,  the  29K  processor’s  instruction  pipe¬ 
line  is  more  quickly  restarted  when  the  handler  returns. 

But,  because  Freeze  mode  has  frozen  the  contents  of  several  important  registers, 
there  are  some  instructions  that  should  not  be  used  in  this  context,  or  whose  use  is 
restricted.  These  instructions  are: 

■  Instructions  that  can  generate  traps.  These  should  not  be  used  because  traps  are 
disabled  in  Freeze  mode.  These  include  ASSERT,  emulated  floating-point 
operations  (e.g.,  FADD),  and  certain  integer  operations  whose  execution 
could  cause  a  trap  to  occur.  Note,  the  Am29050  processor  executes  all  floating 
point  operations  directly  and  thus  these  instructions  can  be  used  with  the 
Am29050  processor  as  they  will  not  generate  a  trap. 

If  a  trap  generating  instruction  is  executed  it  will  have  the  same  affect  as  a  NOP 
instruction.  An  exception  trap  is  caused  by  bad  memory  accesses.  These  traps  are  al¬ 
ways  taken,  even  if  they  occur  in  Freeze-mode  code.  Because  the  processor  registers 
were  already  frozen  at  the  time  of  the  nested  trap,  it  can  be  difficult  to  determine  the 
cause  of  the  trap  or  issue  an  IRET  instruction. 

However,  if  an  Am29050  processor  is  being  used  and  a  trap  occurs  when  the  DA 
bit  is  set  in  the  CPS  register.  Monitor  mode  is  entered.  Monitor  mode  (section  4.3.5) 
can  be  used  by  monitors  to  debug  kernel  Freeze-mode  code. 
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■  Instructions  that  use  special  registers — these  instructions  may  be  used; 
however,  any  modified  registers  may  have  to  be  saved  and  restored  before  the 
interrupt  handler  returns.  The  EXTRACT  and  INSERT  instructions  are  in  this 
category. 

■  Instructions  that  modify  special  registers —  because  of  the  normal  side  effect  of 
their  operation,  these  instructions  must  be  used  with  caution.  There  are  three 
subgroups  within  this  group: 

— Arithmetic  and  logical  instructions  that  set  the  Z,  N,  V,  and  C  status  bits  in  the 
ALU  Status  register.  These  instructions  can  be  used  in  Freeze  mode  if  the  ALU 
status  bits  are  not  used.  Because  Freeze  mode  disables  updating  the  ALU  Status 
register,  extended  precision  arithmetic  instructions,  such  as  ADDC  or  SUBC, 
will  not  execute  properly. 

— Load-Multiple  and  Store  Multiple.  These  instructions  cannot  be  used  in  Freeze 
mode,  because  the  Channel  registers  (CHA,  CHD,  and  CHC)  upon  which  their 
execution  depends  are  frozen. 

— LOAD  and  STORE  instructions  with  the  set  BP  option  enabled,  if  the  Data 
Width  Enable  (DW  bit)  is  0.  In  this  case,  if  BP  must  be  set,  it  will  have  to 
be  done  explicitly  by  using  a  Move-To-Special  Register  (MTSR)  instruction. 
Therefore,  LOAD  and  STORE  instructions  with  word-aligned  addresses  (i.e., 
those  whose  least  significant  2  bits  are  0)  may  be  used  without  additional  effort; 
however,  if  byte  or  half-word  instructions  are  needed,  the  BP  register  must  be 
explicitly  set  prior  to  execution  of  a  non-word-aligned  LOAD,  STORE, 
INSERT,  or  EXTRACT  instruction. 

All  other  instructions  may  be  used  without  restriction,  keeping  in  mind  the  in¬ 
herent  implications  of  Freeze  mode.  (Note:  Other  restrictions  apply  to  Am29000 
processors  manufactured  prior  to  revision  C.) 

4.3.5  Monitor  mode 

Monitor  mode  only  applies  to  the  Am29050  processor.  If  a  trap  occurs  when  the 
DA  bit  in  the  CPS  register  is  a  1,  the  processor  starts  executing  at  address  16  in 
instruction  ROM  space.  Monitor  mode  is  not  entered  as  a  result  of  asynchronous 
events  such  as  timer  interrupts  or  activation  of  the  TRAP(l-O)  or  INTR(3-0)  lines. 

On  taking  a  Monitor  mode  trap  the  Reason  Vector  register  (RSN)  is  set  to  indi¬ 
cate  the  cause  of  the  trap.  Additionally,  the  MM  bit  in  the  CPS  register  is  set  to  1. 
When  the  MM  bit  is  set,  the  shadow  program  counters  (SPCO,  SPC1,  and  SPC2)  are 
frozen,  in  a  similar  way  to  the  FZ  bit  freezing  the  PC0-PC2  registers.  Because  the 
shadow  program  counters  continue  to  record  PC-bus  activity  when  the  FZ  bit  is  set, 
they  can  be  used  to  restart  Freeze  mode  execution.  This  is  achieved  by  an  IRET  or 
IRETINV  instruction  being  executed  while  in  Monitor  mode. 
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Because  Monitor  mode  traps  are  used  by  monitors  in  the  debugging  of  trap  and 
interrupt  handlers  and  are  not  intended  for  operating  system  use,  they  are  dealt  with 
further  in  Chapter  7  ( Software  Debugging). 

4.3.6  Freeze-mode  Clock  Interrupt  Handler 

The  code  shown  in  this  example  illustrates  one  way  to  program  a  clock  that 
keeps  the  current  time.  One  important  aspect  of  this  routine  is  the  need  to  minimize 
overhead  in  the  function,  taking  as  little  time  as  possible  to  update  the  clock  when  an 
interrupt  occurs.  Allocating  two  Operating  System  Static  registers  (ksl,  ks2 )  to  con¬ 
tain  millisecond  and  second  values  reduces  the  need  to  access  data  memory  inside  the 
handler. 

;  freeze  mode  clock  interrupt  handler 


equ 

IN, 0x0200000 

IN-bit  of  TMR  reg 

reg 

CLOCK, ksl 

;1  ms  increments 

reg 

SECS, ks2 

;time  in  seconds 

equ 

CPUCLK, 25 

;  CPU  clock  in  MHz 

equ 

RATE ,1000 

; ints  per  second 

intrl4  : 


const 

itO, IN 

; IN-bit  in  TMR 

consth 

itO, IN 

mfsr 

itl, tmr 

andn 

itl, itl, itO 

; clear  IN-bit 

mtsr 

tmr, itl 

const 

itO, RATE 

cplt 

itO, CLOCK, itO 

; check  if  1  sec 

jmpf 

itO, carry 

; jump  if  CLOCK  >  RATE 

add 

iret 

CLOCK, CLOCK, 1 

; increment  CLOCK 

const 

CLOCK, 0 

add 

iret 

SECS, SECS,  1 

; increment  seconds 

This  handler  executes  once  each  time  an  interrupt  from  the  on-board  timer  oc¬ 
curs.  In  the  preceding  code,  timer  intermpts  are  assumed  to  occur  once  each  millisec¬ 
ond,  therefore  the  value  in  the  CLOCK  register  will  increment  1000  times  in  one  se¬ 
cond.  When  the  1000th  interrupt  occurs,  the  CLOCK  register  is  set  to  0,  and  the  SECS 
variable  is  incremented. 

The  29K  processor  Timer  Counter  register  includes  a  24-bit  Timer  Count  Value 
(TCV)  field  that  is  automatically  decremented  on  every  processor  cycle.  When  the 
TCV  field  decrements  to  0,  it  is  written  with  the  Timer  Reload  Value  (TRV)  field  of 
the  Timer  Reload  (TMR)  register  on  the  next  cycle.  The  Interrupt  (IN)  bit 
of  the  TMR  register  is  set  at  the  same  time.  The  following  code  illustrates  a  technique 
to  initialize  the  timer  for  this  purpose. 
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;  freeze  mode  clock  interrupt  initialization 


.  equ 

TICKS, (CPUCLK* 

1000000/RATE) 

.  equ 

IE,  0x1000000 

; IE-bit  in  TMR  reg 

it : 

const 

itO, TICKS 

;i.e.,  25,000 

consth 

itO, TICKS 

mtsr 

tmc, itO 

; set  counter  value 

const 

itO, (IE | TICKS) 

; value+int . -enable 

consth 

itO, (IE | TICKS) 

mtsr 

tmr, itO 

; set  reload  value 

const 

SECS,  0 

; set  seconds=0 

jmpi 

IrO 

const 

CLOCK,  0 

; set  clock=0 

Assuming  the  processor  is  running  at  25  MHz,  setting  the  timer  reload  and 
count  values  to  25000  causes  the  count  to  decrement  to  0  once  each  millisecond.  This 
will  accumulate  1000  counts  during  one  second  of  CPU  execution.  If  two  Operating 
System  Static  registers  can  not  be  spared  for  this  purpose,  the  SECS  variable  should 
be  located  in  main  memory.  The  modified  code  for  incrementing  the  seconds  counter 
in  memory  is  shown  below. 


SECS:  .word 

carry : 

const 

consth 

load 

add 

const 

store 

iret 


0 


itO, SECS 
itO, SECS 
0,0, itl, itO 
itl , itl , 1 
CLOCK, 0 
0,0,  itl, itO 


Because  the  SECS  variable  is  only  referenced  once  per  second,  the  performance 
degradation  due  to  this  change  would  be  minimal.  The  initialization  code  would  also 
need  to  be  modified  to  set  the  memory  location  for  SECS  to  0  in  this  case. 

4.3.7  Removing  Freeze  mode 

Some  interrupt  handlers  will  benefit  from  removing  Freeze  mode,  without  en¬ 
abling  interrupts,  in  order  to  use  the  load-multiple  and  store-multiple  instructions.  A 
less  common,  reason  for  removing  Freeze  mode  is  the  ability  to  use  ALU  Status  bits: 
V,  N,  Z,  and  C.  In  either  case,  several  registers  must  be  saved  before  the  Freeze-mode 
bit  in  the  CPS  register  can  be  cleared. 

The  removal  of  Freeze  mode  represents  entry  into  the  second  stage  of  interrupt 
handling,  as  shown  in  Figure  4-1. 

The  frozen  Program  Counters  (PC0  and  PCI)  must  be  saved  so  that  the  handler 
will  be  able  to  resume  execution  of  the  interrupted  program.  If  external  data  memory 
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is  to  be  accessed,  the  CHA,  CHD  and  CHC  Channel  registers  must  be  saved  so  that 
their  contents  can  be  restored  after  a  load-  or  store-multiple  instruction  has  been 
executed.  Saving  the  channel  registers  also  saves  the  Count  Remaining  register, 
which  is  contained  within  the  CHC  register.  Additionally,  before  any  ALU/Logical 
operations  are  performed,  the  ALU  register  must  be  saved. 

After  the  Program  Counters  have  been  saved  and  before  any  Channel  or  ALU 
operation  is  executed,  Freeze  mode  can  be  removed  by  clearing  the  Freeze  (FZ)  bit  of 
the  CPS  register.  This  immediately  removes  the  freeze  condition,  and  all  registers, 
including  the  Program  Counters,  will  update  normally.  The  PCO  register  shall  reflect 
the  PC-BUS  activity  on  the  cycle  following  the  clearing  of  Freeze  mode.  One  cycle 
later,  the  PCI  register  shall  begin  to  reflect  the  PC-BUS  activity  for  the  current 
execution  stream.  Other  registers  will  only  be  updated  when  the  relevant  instructions 
are  performed  (as  described  above). 

The  primary  benefit  of  leaving  Freeze  mode  is  the  ability  to  use  the  load-  and 
store-multiple  instructions.  After  Freeze  mode  has  been  exited,  the  DA  bit  in  the  CPS 
register  is  still  set  and  instructions  causing  traps  should  not  be  used.  Thus,  many  of 
the  restrictions  listed  in  the  section  titled  Operating  in  Freeze  mode  (section  4.3.4) 
will  still  apply,  with  the  additional  requirement  that  several  of  the  interrupt  temporary 
global  registers  will  be  needed  to  hold  the  saved  registers. 

An  example  of  code  that  implements  removing  Freeze  mode  is  shown  below. 

;  Removing  Freeze  mode  example  code 


.  equ 

FZ, 0x00000400 

FZ-bit  in 

CPS 

.  equ 

SM, 0x00000010 

SM-bit  in 

CPS 

.  equ 

PD, 0x00000040 

PD-bit  in 

CPS 

.  equ 

PI, 0x00000020 

Pi-bit  in 

CPS 

.  equ 

DI, 0x00000002 

Di-bit  in 

CPS 

.  equ 

DA, 0x00000001 

DA-bit  in 

CPS 

.  equ 

REMOVE, (SM| PD | PI | DI | DA) 

.  equ 

FREEZE, (REMOVE 

1  FZ) 

; interrupt  vector  points  here 

mf  sr 

itO, pcO 

;  save  PCO 

mf  sr 

itl, pci 

;  save  PCI 

mtsrim 

cps , REMOVE 

; remove  Freeze  mode 

mf  sr 

it3, alu 

;  save  ALU 

mf  sr 

ktO, cha 

;  save  CHA 

mf  sr 

ktl, chd 

; save  CHD 

mf  sr 

kt2 , chc 

;  save  CHC 

interrupt  handler  code 

goes  here 

mtsr 

chc, kt2 

;  restore 

CHC 

mtsr 

chd, ktl 

;  restore 

CHD 

mtsr 

cha, ktO 

;  restore 

CHA 

mtsr 

alu, it3 

;  restore  . 

ALU 

mtsrim 

cps , FREEZE 

;  set  Freeze  mode 

mtsr 

pci, itl 

;  restore  : 

PCI 
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mtsr  pcO,itO  /restore  PCO 

iret 

The  example  code  begins  by  saving  the  Program  Counters  (PCO  and  PCI),  us¬ 
ing  MFSR  instructions  to  move  the  values  from  special  registers  to  temporary  global 
registers  itO  and  itl. 

Freeze  mode  is  then  disabled  by  clearing  the  FZ  bit  in  the  CPS  register.  (Note  the 
bits  set  by  the  MTSRIM  instruction  are  system  implementation  dependent;  the  RE  bit 
may  be  required.)  Once  Freeze  mode  is  turned  off,  the  ALU  register  will  be  modified 
by  any  ALU/Logical  operation.  Thus,  it  is  important  that  the  ALU  register  be  saved 
now.  (Note  that  two  processor  cycles  are  needed,  after  Lreeze  mode  is  removed,  to 
allow  the  program  state  to  properly  update  the  program  counters.) 

If  interrupts  are  not  to  be  re-enabled  and  the  kernel  does  not  require  the  use  of 
global  registers  ( kt0-kt2 ),  then  these  registers  can  be  used  to  extend  the  number  of 
available  interrupt  temporary  registers. 

The  ALU  register  is  saved  in  temporary  kernel  register  it3.  The  Channel  regis¬ 
ters  (CHA,  CHD  and  CHC)  are  then  saved  in  operating  system  temporary  registers 
kt0-kt2. 

The  interrupt  handler  is  still  executing  with  interrupts  disabled  at  this  point  in 
the  program,  but  load-  and  store-multiple  instructions  can  be  freely  used,  as  long  as 
they  do  not  cause  another  interrupt  or  trap  to  occur.  Note,  even  with  the  DA  bit  in  the 
CPS  register  set,  certain  traps  such  as  a  Data  Access  Exception  can  still  be  taken. 
When  the  handler  is  finished,  it  must  reverse  the  process  by  restoring  all  the  saved 
registers.  No  particular  order  of  instructions  is  necessary,  as  long  as  Lreeze  mode  is 
entered  before  PCI  and  PCO  are  restored.  Additionally,  instructions  affecting  the 
ALU  register  must  not  be  used  after  the  saved  value  has  been  restored.  By  restoring 
the  ALU  unit  after  Lreeze  mode  is  entered,  instructions  are  prevented  from  affecting 
the  ALU  register. 

When  the  IRET  instruction  is  executed,  the  restored  Program  Counters 
(PC0-PC1)  are  used  to  resume  the  interrupted  program.  The  restored  CPS  (saved  in 
OPS  by  the  CPU)  and  Channel  register  contents  are  used  to  restart  any  unfinished 
operations. 

If  enough  global  registers  are  not  available  for  saving  the  Program  Counters  and 
Channel  registers,  memory  could  be  used  for  this  purpose.  In  this  case,  six  words  of 
memory  are  needed.  Example  code  for  saving  and  restoring  the  registers  in  the  user’s 
memory  stack  is  shown  below.  Note,  the  pushsr  and  popsr  macro  instructions  first 
introduced  in  section  3.3. 1  (page  119),  are  used  in  the  example  code  and  are  presented 
again  below: 


.macro  pushsr, sp, reg, sreg 
mfsr  reg, sreg 

sub  sp, sp, 4 

store  0,0, reg, sp 
.  endra 
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;  .macro  popsr, sreg, reg, sp 

load  0,0, reg, sp 

add  sp, sp, 4 

mtsr  sreg, reg 

.  endra 

;  save  registers  on  memory  stack 

t 

pushsr  msp,it0,pc0  ; save  PCO 

pushsr  msp,itO,pcl  ; save  PCI 

pushsr  msp,itO,alu  ; save  ALU 

pushsr  msp,itO,cha  ; save  CHA 

pushsr  msp,itO,chd  ; save  CHD 

pushsr  msp,itO,chc  ; save  CHC 

/ 

const  it3,FZ 

mf  sr  it2 , cps 

andn  it2,it2,it3 

mtsr  cps,it2  ; remove  Freeze  mode 

r 

;  The  interrupt  handler  code  goes  here 


const 

it3, FZ 

mf  sr 

it2 , cps 

or 

it2 , it2 , it3 

mtsr 

cps , it2 

; set  Freeze  mode 

popsr 

chc, itO, msp 

; restore 

CHC 

popsr 

chd, itO, msp 

; restore 

CHD 

popsr 

cha, itO, msp 

; restore 

CHA 

popsr 

alu, itO, msp 

; restore 

ALU 

popsr 

pci, itO, msp 

; restore 

PCI 

popsr 

iret 

pcO, itO, msp 

; restore 

PCO 

The  previous  code  can  be  made  more  efficient  by  saving  more  registers  at  a  time, 
at  the  expense  of  using  a  greater  number  of  global  registers.  Using  store-multiple 
instructions  to  save  the  registers’  contents  takes  advantage  of  Burst  mode  in  the 
processor  memory  system. 

4.3.8  Handling  Nested  Interrupts 

Handling  Nested  Interrupts  is  a  complex  topic,  and  the  method  presented  in  this 
section  discusses  multiple  levels  of  interrupt  nesting  [Mann  1992b].  Two  methods 
are  presented.  The  first  method  results  in  an  interrupt  mechanism  similar  to  the 
interrupt  scheme  used  by  some  CISC  microprocessors.  The  second  method  takes 
advantage  of  the  29K  family  RISC  architecture,  and  offers  better  performance.  The 
following  section,  titled  An  Interrupt  Queuing  Model  (section  4.3.10),  provides  an 
alternative  solution  to  the  problem  that  offers  better  interrupt  processing  throughput. 

For  any  interrupt  handler  taking  a  significant  amount  of  time  to  execute,  it  is 
usually  important  to  permit  interrupts  of  a  higher  priority  to  occur.  This  keeps  the 
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latency  of  higher  priority  interrupts  within  acceptable  limits.  Whenever  an  interrupt 
is  allowed  to  preempt  the  execution  of  another  interrupt  handler,  the  interrupts  are 
said  to  be  “nested.”  That  is,  execution  of  the  lower  priority  handler  is  interrupted,  and 
the  higher  priority  handler  begins  execution  immediately. 

To  allow  for  nested  interrupts,  it  is  only  necessary  to  save  the  registers  or 
temporary  variables  that  could  be  overwritten  by  a  new  interrupt  handler’s  context. 
As  in  the  previous  example,  the  program  counters  (PCO  and  PCI)  and  channel 
registers  (CHA,CHD,  and  CHC)  need  to  be  saved.  In  addition,  because  more  than 
one  execution  thread  may  need  to  be  restarted,  the  Old  Processor  Status  (OPS)  and 
ALU  registers  must  be  saved. 

Because  an  interrupt  may  occur  immediately  after  being  enabled,  it  is  important 
that  the  PCO  and  PCI  registers  reflect  the  activity  of  the  current  execution  PC-BUS. 
As  already  described  in  the  Removing  Freeze  Mode  section,  a  two  cycle  delay  occurs 
before  the  PCI  register  starts  updating.  Thus  Freeze  mode  must  be  removed  two 
cycles  before  interrupts  are  enabled. 

If  the  interrupt  handler  intends  to  use  integer  multiply  or  divide  instructions  or 
emulated  floating  point  instructions,  the  contents  of  the  Indirect  Pointers  (IPA,  IPB 
and  IPC)  and  the  Q  register  should  also  be  saved.  Before  interrupts  are  enabled,  it  is 
also  important  to  clear  the  CHC  register,  so  that  incomplete  load-  or  store-multiple 
instructions  are  not  restarted  when  the  first  interrupt  return  (IRET)  instruction  is 
executed.  Figure  4-3  illustrates  the  context  in  which  this  could  lead  to  unfortunate 
results. 

In  Figure  4-3,  execution  of  a  load-multiple  instruction  in  the  main  program  is  in 
progress  when  an  external  interrupt  occurs.  This  results  in  control  being  given  to  a 
first-level  interrupt  handler.  The  handler  enables  interrupts,  and  another  interrupt  oc- 
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curs  (e.g.,  a  Timer  Interrupt).  When  this  happens,  the  second-level  interrupt  handler 
is  given  control. 

After  completing  its  processing,  execution  of  an  IRET  instruction  causes  the 
processor  to  use  the  information  in  its  CHC  register  to  resume  the  interrupted  load- 
multiple  instruction;  but  this  is  in  the  context  of  the  first-level  interrupt  handler,  rath¬ 
er  than  in  the  main  program  where  it  was  interrupted. 

This  CHC  discussion  is  merely  an  explanation  to  stress  that  the  CHC  register 
should  not  only  be  saved  and  restored  in  each  interrupt  level,  but  that  CHC  should 
also  be  cleared  before  interrupts  are  enabled.  This  will  ensure  that  only  when  the 
proper  copy  of  the  CHC  is  restored  will  execution  of  an  IRET  instruction  restart  the 
interrupted  load-  or  store-multiple  operation. 

A  problem,  relating  to  clearing  the  CHC  register,  has  been  observed  with  a 
number  of  29K  family  members.  The  problem  effects  the  last  word  of  a  LOADM 
instruction  reaching  its  destination  register  when  the  LOADM  is  interrupted.  The 
problem  can  be  overcome  by  performing  a  LOADM  or  STOREM  instruction  in  the 
interrupt  handler  after  coming  off  Freeze  mode  but  before  reenabling  interrupts.  The 
LOADM  or  STOREM  must  use  a  CR  value  of  one  or  greater.  Processors  have  a 
hidden  internal  shadow  CHC  which  may  not  be  cleared  by  a  move  of  zero  into  the 
CHC  register.  A  LOADM  or  STOREM  instruction  causes  the  hidden  CHC  register  to 
be  cleared.  The  problem  can  also  be  overcome  by  performing  a  STORE  or  LOAD 
instruction  while  still  in  Freeze  mode.  If  interrupts  are  not  reenabled  by  the  interrupt 
handler,  no  special  steps  are  required  to  deal  with  the  interrupted  LOADM  difficulty. 
The  problem  is  of  little  importance,  as  interrupt  handlers  generally  perform  the 
solutions  described  without  additional  code  being  added. 

Additionally,  when  a  trap  occurs  as  a  result  of  a  Data  Exception  Error  (DERR) 
the  TF  bit  in  the  CHC  will  become  set.  It  is  important  that  the  CHC  register  be  cleared 
rather  than  be  restored  for  the  context  containing  the  violating  data  access.  Otherwise 
an  interrupt  handler  loop  will  result. 

4.3.9  Saving  Registers 

The  following  code  illustrates  saving  the  necessary  registers,  turning  off  Freeze 
mode,  and  enabling  interrupts. 


; multi-level 

nested  interrupt 

handler 

; example  code 

r 

intrO  : 

; save  registers 

pushsr 

msp, itO, pcO 

;  save 

PCO 

pushsr 

msp, itO, pci 

;  save 

PCI 

pushsr 

msp, itO, alu 

;  save 

ALU 

pushsr 

msp, itO, cha 

;  save 

CHA 

pushsr 

msp, itO, chd 

;  save 

CHD 

pushsr 

msp, itO, chc 

;  save 

CHC 
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pushsr  msp,itO,ops 


; save  OPS 


; come  off  freeze  -  could  use  mtsrim 


const 

itl , FZ 

mf  sr 

itO, cps  ; 

get  CPS 

andn 

itO, itO, itl 

mtsr 

cps , it  1  ; 

remove  Freeze  mode 

more  regs  while  PC0,1  get 

in  step 

pushsr 

msp,  itO, ipa  ; 

save  IPA 

pushsr 

msp, itO, ipb  ; 

save  IPB 

pushsr 

msp, itO, ipc  ; 

save  IPC 

pushsr 

msp, itO , q  ; 

save  Q 

mtsrim 

CHC, 0  ; 

clear  CHC 

andn 

itO, itl,  (DI I  DA) 

mtsr 

cps , itl  ; 

enable  interrupts 

dispatch : 

/ 

;  Interrupt  handler  code  starts  here. 

;  Dispatch  to  appropriate  service  routine. 


Saving  the  Indirect  Pointers  and  Q  register  is  a  user  preference,  but  their  con¬ 
tents  are  modified  by  several  29K  processor  instructions.  It  is  important  to  bear  this  in 
mind  when  writing  interrupt  handlers.  The  safest  approach  is  to  always  save  the  con¬ 
tents  of  these  registers. 

The  above  code  uses  a  stack  to  save  the  register  contents,  similar  to  the  way  a 
CISC  processor’s  microcode  saves  processor  state.  However,  better  performance  can 
be  achieved  by  use  of  the  large  number  of  processor  registers  to  cache  the  interrupted 
context  before  having  to  resort  to  an  interrupt  context  stack.  The  following  code  per¬ 
forms  much  the  same  task  as  the  previous  code,  but  it  can  reach  the  interrupt  dispatch¬ 
er  (label  dispatch:)  in  twelve  cycles  less  for  the  first  interrupt  and  costs  only  an  addi¬ 
tional  two  cycles  for  interrupts  at  greater  levels  of  nesting  (assuming  MTSRIM  is 
used  to  update  the  CPS  register). 

This  code  implements  a  first  level  interrupt  context  cache  in  global  registers 
kt4-ktl0.  Global  register  ktll  is  used  to  keep  a  record  of  the  current  level  of  interrupt 
nesting;  and  should  be  initialized  to  -1,  that  is  cache  empty.  Considering  the  speed  of 
the  29K  family,  it  is  likely  the  first-level  interrupt  processing  will  be  complete  before 
a  further  interrupt  occurs,  thus  avoiding  the  need  to  save  context  on  a  memory  stack. 
The  use  of  registers  rather  than  memory  to  save  context  also  results  in  reduced  laten¬ 
cy  between  the  time  the  interrupt  occurred  and  the  appropriate  service  routine  starts 
executing. 

The  example  code  below  does  not  store  the  indirect  pointer  registers  (IPA,  IPB, 
IPC,  and  Q).  These  registers  do  not  need  to  be  saved  except  by  interrupt  handlers 
which  either  make  use  of  the  indirect  pointers,  use  emulated  arithmetic  instructions, 
or  use  integer  multiply  or  divide.  Best  performance  is  achieved  by  postponing  the 
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saving  of  these  registers  to  the  specific  handler  routine  which  expects  to  use  them. 
Correspondingly,  a  handler  which  uses  them  is  also  responsible  for  restoring  them. 

.equ  Kmode, (PD | PI | SM| IM) 


not_lst:  ; save  on  stack 


pushsr 

msp,  itO, pcO 

;  save  PCO 

pushsr 

msp, itO , pci 

; save  PCI 

pushsr 

msp,  itO, alu 

;  save  ALU 

pushsr 

msp,  itO , cha 

; save  CHA 

pushsr 

msp, itO, chd 

; save  CHD 

pushsr 

msp,  itO, chc 

; save  CHC 

pushsr 

msp,  itO, ops 

; save  OPS 

jmp 

dispatch-8 

mtsrim 

cps , REMOVE 

;  remove  Freeze 

mode 

intrO : 

;  save  registers 

jmpf 

ktll, not_lst 

;test  cache  in 

use 

add 

ktll, ktll, 1 

; level  count 

f 

cache : 

;  save  in  cache 

mf  sr 

kt4 , pcO 

;  save  PCO 

mf  sr 

kt5, pci 

; save  PCI 

mtsrim 

cps , REMOVE 

;  remove  Freeze 

mode 

mf  sr 

kt6, alu 

; save  ALU 

mf  sr 

kt7 , cha 

; save  CHA 

mf  sr 

kt8, chd 

; save  CHD 

mf  sr 

kt9, chc 

; save  CHC 

mf  sr 

ktlO, ops 

;  save  OPS 

r 

mtsrim 

chc,  0 

; clear  CHC 

mtsrim 

cps , Kmode 

; enable  interrupts 

r 

dispatch : 

r 

;  Interrupt  handler  code  starts 

here . 

;  Dispatch  to 

appropriate  handler. 

4.3.10  Enabling  Interrupts 

Interrupts  are  enabled  by  clearing  the  DI  and  DA  bits  of  the  CPS.  If  an  un¬ 
masked  interrupt,  INTR[0.  .3] ,  is  pending  at  this  point  (the  IP  bit  of  the  CPS  register  is 
set  to  1),  the  processor  will  immediately  process  the  interrupt  and  execute  the  handler 
at  the  new  vector  address. 

In  the  previous  code  example,  when  interrupts  are  enabled,  and  if  an  interrupt 
occurs,  the  succeeding  register  saves  will  not  be  performed;  however,  the  recently 
invoked  interrupt  handler  will  save  these  registers  if  it  intends  to  enable  interrupts 
during  its  execution.  The  contents  of  the  Indirect  Pointers  and  Q  register  will  be  pre¬ 
served,  or  not  touched,  depending  on  the  nature  of  the  nested  interrupt  handler. 

When  clearing  the  DI  and  DA  bits  of  the  CPS  register,  the  state  of  the  other  bits 
must  be  saved.  The  first  example  code  illustrates  this  by  using  an  ANDN  instruction 


212 


Evaluating  and  Programming  the  29K  RISC  Family 


to  AND  the  current  contents  of  the  register,  with  a  complement  bit  pattern  of  the  DA 
and  DI  bits  in  that  register  (i.e.,  1111  1111  1111  1100). 

Figure  4-4  shows  the  interrupt  enable  logic  of  the  Am29000  processor.  Notice 
that  interrupts  generated  by  the  on-chip  timer  are  controlled  by  the  DA  bit  in  the  CPS 
register.  This  indicates  it  is  impossible  to  enable  traps  for  use  by  ASSERT  and  other 
instructions,  without  also  permitting  asynchronous  interrupts  from  the  timer  to  occur 
(unless  the  on-chip  timer  is  not  being  used).  If  it  is  necessary  to  avoid  timer  interrupts, 
the  IE  bit  in  the  TMR  register  can  be  saved,  then  cleared  to  disable  timer  interrupts. 

The  interrupt  inputs  to  the  Prioritizer  logic  (as  shown  in  Figure  4-4)  are  not 
latched,  and  must  be  continuously  asserted  by  an  interrupting  external  device  until 
the  interrupt  has  been  recognized.  Recognition  of  the  interrupt  is  usually  accom¬ 
plished  by  executing  an  instruction  that  accesses  the  interrupting  device.  This  re¬ 
moves  the  interrupt  request,  which  must  be  done  before  interrupts  are  enabled;  other¬ 
wise,  the  same  interrupt  will  recur  immediately  when  interrupts  are  enabled. 

The  Interrupt  Mask  (IM)  field  of  the  CPS  register  can  be  used  to  disable  recogni¬ 
tion  of  interrupt  requests  on  the  INTR  inputs.  The  mask  bits  implement  a  simplified 
interrupt  priority  scheme  that  can  be  set  up  to  recognize  only  higher-priority  inter¬ 
rupts,  while  another  handler  is  in  execution. 

The  two-bit  IM  field  allows  four  priority  levels  to  be  established.  An  IM  field 
value  of  zero  (IM=00)  enables  only  the  interrupts  occurring  at  the  INTRo  input. 
When  IM  =  01 ,  both  INTRo  and  INTRi  are  enabled;  if  IM  =  10,  then  INTRo,  INTRi, 
and  INTR2  are  enabled;  and  if  IM  =11,  then  INTRo,  INTRi,  INTR2,  and  INTR3  are 
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enabled.  The  only  way  to  disable  the  INTRo  input  is  to  set  the  DI  (Disable  Interrupts) 
bit  to  1  in  the  CPS  register. 

An  example  code  fragment  that  sets  the  IM  bits  for  a  handler,  according  to  its 
priority,  is  shown  below. 

;  set  interrupt  mask  according  to  priority,  then  enable  interrupts 


equ 

MYLEVEL, 2 

equ 

IM, ObllOO 

setim : 


mf  sr 

itO, 

cps 

andn 

itO, 

itO, 

( IM | DI |DA) 

or 

itO, 

itO, 

( (MYLEVEL— 1 ) <<2 ) 

ratsr 

cps, 

itO 

In  the  above  example,  after  the  CPS  has  been  moved  to  a  global  register,  the  bits 
corresponding  to  the  IM  field,  the  DI  bit,  and  the  DA  bit  are  cleared  by  ANDing  them 
with  a  complement  mask.  Next,  the  bits  defined  by  the  MYLEVEL  definition  (de¬ 
creased  by  1 )  are  ORed  into  the  proper  position  in  the  IM  field,  and  the  result  is  stored 
back  into  the  CPS.  With  the  values  shown,  the  IM  field  is  set  to  the  value  01,  which 
enables  interrupts  on  INTRo  and  INTRi. 

In  the  main  part  of  the  handler,  any  Am29000  processor  instructions  can  be 
executed;  however,  because  most  of  the  global  registers  have  not  been  saved,  the  han¬ 
dler  may  not  have  any  extra  working  space.  Depending  on  the  number  of  registers 
needed  to  carry  out  the  handler’s  task,  a  few  additional  global  registers  may  have  to 
be  saved,  then  restored. 

4.3.11  Restoring  Saved  Registers 

The  final  act  of  an  interrupt  or  trap  handler,  before  executing  the  IRET  instruc¬ 
tion,  is  to  restore  the  contents  of  all  saved  registers  so  the  complete  environment  of 
the  interrupted  task  is  restored  before  execution  is  resumed.  The  proper  approach  to 
restoring  the  saved  registers  is  to  reverse  the  steps  taken  to  save  them. 

Any  additional  registers  saved  by  a  specific  handler  called  by  the  interrupt  dis¬ 
patcher  must  restore  the  additional  registers  before  the  generic  interrupt  return  code 
is  executed.  In  the  case  of  an  external  interrupt,  it  is  also  important  that  the  specific 
handler  has  cleared  the  external  device  causing  the  interrupt  line  to  be  held  active. 
Otherwise,  the  processor  may  be  forced  into  an  interrupt  handler  loop.  Because  of 
internal  delays  in  the  processor,  the  external  interrupt  must  be  cleared  at  least  three 
cycles  before  interrupts  are  enabled.  In  practice  this  requirement  is  easily  met. 

At  this  point,  interrupts  are  still  enabled.  The  last  portion  of  the  restoration  pro¬ 
cess  must  run  with  interrupts  disabled,  because  important  processor  configuration 
data  is  being  reloaded,  and  an  interrupt  occurring  during  this  phase  could  hopelessly 
confuse  the  process. 
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The  final  code  fragment  is  shown  below: 

;  code  to  disable  interrupts  and  complete 
;  the  restoration  of  registers  prior  to 
;  issuing  an  IRET  instruction. 


popsr 

msp, itO, q 

; restore 

Q 

popsr 

msp, itO, ipc 

;  IPC 

popsr 

msp, itO, ipb 

;  IPB 

popsr 

msp, itO, ipa 

;  IPA 

const 

it3,  (FZ | DI |DA) 

mf  sr 

it2 , cps 

; disable 

interrupts 

or 

it2 , it2 , it3 

;  and 

mtsr 

cps , it2 

; set  Freeze  mode 

popsr 

ops, itO, msp 

; restore 

OPS 

popsr 

chc, itO, msp 

;  CHC 

popsr 

chd, itO, msp 

;  CHD 

popsr 

cha, itO, msp 

;  CHA 

popsr 

alu, itO, msp 

;  ALU 

popsr 

pci, itO, msp 

;  PCI 

popsr 

pcO, itO, msp 

;  PCO 

iret 

The  interrupt  context  restore  code  for  the  first-level  context  cache  method  is 
shown  below.  Restoring  the  context  from  registers  is  much  faster  than  accessing  an 
external  memory  stack. 


.  equ 

DISABLE,  (PD  |  PI 

SM|DI |DA) 

.  equ 

FREEZE, (DISABLE |FZ) 

sub 

ktll, ktll, 1 

; decrement 

jmpf 

ktll, not_lst 

; level  counter 

mtsrim 

cps , FREEZE 

; disable 

and  Freeze 

; restore 

from  cache 

mtsr 

ops, ktlO 

; restore 

OPS 

mtsr 

chc, kt9 

; restore 

CHC 

mtsr 

chd, kt8 

; restore 

CHD 

mtsr 

chc, kt7 

; restore 

CHA 

mtsr 

alu, kt6 

; restore 

ALU 

mtsr 

pci , kt5 

; restore 

PCI 

mtsr 

pcO , kt4 

; restore 

PCO 

iret 

; restore 

from  stack 

popsr 

ops , itO , msp 

; restore 

OPS 

popsr 

chc, itO, msp 

; restore 

CHC 

popsr 

chd, itO , msp 

; restore 

CHD 

popsr 

cha, itO , msp 

; restore 

CHA 

popsr 

alu, itO , msp 

; restore 

ALU 

popsr 

pci , itO , msp 

; restore 

PCI 

popsr 

pcO , itO , msp 

; restore 

PCO 

iret 
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4.3.12  An  Interrupt  Queuing  model 

One  approach  to  solving  the  latency  demands  of  a  high-performance  system  is 
to  simply  queue  interrupts  in  a  linked  list  when  they  occur,  and  process  them  in  a 
higher-level  context.  Figure  4-5  illustrates  the  structure  and  linkages  of  individual 
entries  in  the  example  queue.  This  method  results  in  a  greater  interrupt  processing 
throughput.  Less  time  is  spent  executing  Freeze  mode  context  stacking  and  unstack¬ 
ing  when  compared  with  the  previously  described  nested  interrupt  handling  method. 

In  the  example  program,  only  a  few  global  registers  are  allocated — because 
placing  an  entry  into  a  global  queue  is  a  simple  operation. 

The  example  code  in  this  section  applies  to  handling  receive  data  interrupts 
from  a  UART  port,  but  several  types  of  interrupts  can  easily  share  the  same  queue. 
For  simplicity,  queue  entries  consist  of  three  words  plus  an  optional  data  block. 

■  Pointer  to  the  next  entry  in  the  queue  (forward  link). 

■  Received  data  count  /  active  flag. 

■  Pointer  to  the  handler  for  this  entry. 

■  An  optional  data  block. 

Once  an  I/O  operation  has  begun  (in  this  case,  reception  of  data  from  a  UART), 
an  interrupt  occurs  for  the  UART  device  and  the  handler  is  called  to  place  a  new  entry 
into  the  global  queue. 

As  each  byte  arrives,  the  first  section  of  the  handler  continues  the  I/O  process, 
often  by  simply  reading  the  data  from  the  UART  and  indicating  that  the  data  has  been 
accepted.  This  causes  the  UART  to  remove  the  interrupt  input  and  prepare  to  receive 
new  data. 


Head  of  Queue  First  Entry  Second  (last)  Entry  Tail  of  Queue 


IRQH 

Next  Entry 

Next  Entry 

IRQT 

Count/Active  Flag 

Count/Active  Flag 

Pointer  to  Handler 

Pointer  to  Handler 

Optional  Data 

Optional  Data 

Figure  4-5.  Interrupt  Queue  Entry  Chaining 
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Only  one  receive  operation  for  a  given  interrupt  can  be  in  progress  at  a  time .  This 
allows  the  queue  entry  to  contain  three  things:  a  static  entry  descriptor  that  holds  a 
pointer  to  the  next  entry  in  the  queue,  the  byte  count,  and  a  pointer  to  the  high-level 
handler  function. 

The  example  shown  below  uses  four  global  (interrupt  temporary)  registers  for 
its  queue  building  processes.  Because  interrupts  are  disabled  during  this  entire  part  of 
the  process,  handlers  for  other  interrupts  can  use  these  same  registers. 

After  the  first  byte  has  been  stored  in  the  static  buffer,  the  handler  must  deter¬ 
mine  if  the  queue  is  empty  or  if  it  already  contains  one  or  more  entries.  If  empty,  the 
handler  can  immediately  invoke  a  routine  to  process  the  entry.  If  the  queue  contains 
one  or  more  entries,  the  current  entry  is  linked  into  the  queue.  The  code  is  shown  be¬ 
low. 


;  UART 

receive 

interrupt  handler  (intrO) 

r 

.  reg 

irqh, gr80 

; queue  head  pointer 

.  reg 

irqt, gr81 

; queue  tail  pointer 

.  data 

entry : 

.  word 

0,0, receive 

; entry  descriptor 

. block 

256 

;  and  data  block 

intrO : 

const 

itl, entry+4 

; address  of  entry 

consth 

itl , entry+4 

load 

0,0, itO, itl 

; get  count 

add 

it3, itl, 8 

; address  of  data 

add 

it3, it3, itO 

; add  count 

add 

itO, itO, 4 

; increment  count 

store 

0,0, itO, itl 

; count->entry+4 

const 

it2 , uart_rcv 

; UART  data  address 

consth 

it2 , uart_rcv 

load 

0 , 1 , it2 , it2 

; get  data  from  UART 

store 

0 , 1 , it2 , it 3 

; save  in  buffer 

cpeq 

itO, itO, 4 

; first  byte? 

jmpt 

itO, startup 

;yes,  start  daemon 

nop 

iret 

;no,  return 

startup : 

; go  daemon  if 

not  already  running 

cpeq 

it2 , irqh, 0 

; is  queue  empty 

jmpf 

it2 , add 

;no,  link  this  entp 

sub 

it  1 , it  1 ,  4 

; point  to  entry 

jmp 

daemon 

;yes,  go  daemon 

add 

irqh, itl, 0 

; init  queue  header 

add : 

store 

0,0, itl, irqt 

; tail->entry 
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add 

iret 


irqt, itl,  0 


; entry->tail 
;  return 


When  interrupts  occur  for  the  second  and  succeeding  bytes,  they  are  stored  in 
the  local  data  block,  following  the  static  descriptor  entry. 

After  each  byte  has  been  stored,  the  handler  can  immediately  return  because  a 
routine  has  been  invoked  to  process  the  entire  queue.  In  UNIX  systems,  this  routine  is 
often  called  a  daemon.  Once  invoked,  it  continues  to  process  entries  until  the  queue  is 
empty,  at  which  point  it  terminates  its  own  execution.  The  title  Dispatcher  shall  be 
used  to  describe  the  routine  invoked  to  process  queue  entries  (see  Figure  4-6).  A 
dispatcher  routine  may  operate  in  User  mode;  in  such  case  it’s  operation  is  very  simi¬ 
lar  to  a  signal  handler  (described  in  section  4.4). 


add  queue  entry 


Dispatcher 


Interrupt 

queue 

processing 


IRET 


□ 


Freeze  mode 
code 


Figure  4-6.  An  Interrupt  Queuing  Approach 


The  queue  processing  Dispatcher  for  this  example  must  run  with  interrupts  en¬ 
abled;  otherwise,  new  data  bytes  could  not  be  received  from  the  UART,  and  other  in¬ 
terrupt  driven  processes  could  not  execute.  Before  interrupts  are  enabled,  a  number 
of  processor  registers  must  be  saved,  as  indicated  earlier.  Nine  kernel  temporary  reg¬ 
isters  are  allocated  for  this  purpose  ( kt3-ktll ).  Because  the  Dispatcher  is  used  to  pro¬ 
cess  all  queued  interrupts,  it  will  not  be  necessary  to  push  these  temporary  registers 
onto  the  memory  stack.  The  example  queue  processing  code  is  shown  below. 

;  queue  processing  Dispatcher 


.equ  DISABLE, (PD | PI | SM | DI | DA) 

.equ  Kmode, (PD | PI | SM| IM) 

.equ  FREEZE, (DISABLE | FZ) 


Dispatcher : 

mfsr  kt3,PC0 
mfsr  kt4,PCl 
mfsr  kt5,PC2 


;  save  PCO 
; save  PCI 
; save  PC2 
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mf  sr 

kt  6 , CHA 

; save  CHA 

mf  sr 

kt  7 , CHD 

; save  CHD 

mf  sr 

kt8, CHC 

; save  CHC 

mf  sr 

kt9, ALU 

; save  ALU 

mf  sr 

ktlO, OPS 

; save  OPS 

mtsrim 

CPS, DISABLE 

; remove  Freeze  mode 

mtsrim 

CHC,  0 

; clear  CHC 

add 

irqt , irqh, 0 

; set  tail  =  head 

loop : 


mtsrim 

CPS , Kmode 

;  enable  interrupts 

add 

ktll, irqh, 8 

; point  to  handler 

load 

0, 0, ktll, ktll 

;  get  address 

calli 

ktll, ktll 

;call  handler 

nop 

mtsrim 

CPS, DISABLE 

;disable  interrupts 

cpeq 

ktll, irqt, irqh 

;queue  empty? 

jmpt 

ktll, finish 

;yes,  wrapup 

nop 

load 

0,  0,  ktll, irqh 

;no,  get  next  entry 

jmp 

loop 

; and  loop  back 

add 

irqh, ktll, 0 

;with  head<-next 

finish : 


mtsrim 

cps , FREEZE 

;enable  : 

freeze  mode 

const 

irqh, 0 

;make  queue  empty 

mtsr 

PCO, kt3 

; restore 

PCO 

mtsr 

PCI, kt4 

; restore 

PCI 

mtsr 

PC2 , kt5 

; restore 

PC2 

mtsr 

CHA, kt  6 

; restore 

CHA 

mtsr 

CHD, kt7 

; restore 

CHD 

mtsr 

CHC, kt8 

; restore 

CHC 

mtsr 

ALU, kt  9 

; restore 

ALU 

mtsr 

OPS, ktlO 

; restore 

OPS 

iret 

(•terminate  executio 

Note  that  the  example  code  does  not  save  the  Indirect  Pointers  (IPA-IPC)  or  the 
Q  register.  If  any  of  the  individual  high-level  handlers  will  disturb  the  contents  of 
these  registers,  they  must  also  be  saved.  If  high-level  handlers  are  written  carefully,  it 
will  not  be  necessary  to  save. 

The  queue  processor  is  responsible  for  removing  entries  from  the  queue  and 
calling  the  handler  associated  with  each  entry.  In  the  above  example,  a  pointer  to  the 
high  level  handler  is  contained  in  the  third  word  of  the  entry  descriptor  (in  this  case, 
receive). 

The  handler  is  called  after  Freeze  mode  has  been  disabled,  and  interrupts  are  en¬ 
abled.  When  the  handler  receives  control,  the  IRQH  register  points  to  the  queue  entry. 

The  high-level  handler  is  responsible  for  removing  the  data  associated  with  a 
queue  entry,  and  it  must  do  this  with  interrupts  disabled;  however,  interrupts  need 
only  be  disabled  while  the  data  is  being  removed  and  when  the  queue  entry  data  count 
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is  reset  to  zero.  Any  other  portions  of  the  handler  not  relevant  to  these  tasks  can  run 
with  interrupts  enabled. 

After  the  handler  has  disposed  of  the  data,  it  returns  control  to  the  Dispatcher, 
which  disables  interrupts,  enables  Freeze  mode,  and  attempts  to  process  the  next 
entry  in  the  queue.  If  no  entries  remain,  it  restores  the  saved  registers  from  kernel 
temporary  registers  kt3-ktl0,  and  executes  an  IRET  instruction  to  return  control  to 
the  interrupted  task. 

In  cases  where  a  transaction  with  an  external  device  takes  a  long  time,  compared 
with  the  execution  time  of  the  high-level  handler,  the  data  is  moved  in  chunks. 
An  execution  profile  of  this  process  might  include  the  following  threads. 


User  process  a 
lnterrupt-1 
lnterrupt-2 
Process  daemon 
High-level  Handler-1 
High-level  Handler-2 


User  process 
is  interrupted 


Handler-1  is 
interrupted 


t  User  process 
resumes  exeuction 


Process  daemon 
returns 


Handler-1 

completes 


Handler-2  is 
interrupted 


Figure  4-7.  Queued  Interrupt  Execution  Flow 


■  Interrupt  function  stores  several  bytes  of  data  into  the  data  block. 

■  Process  Dispatcher  executes  the  high-level  handler,  which  empties  the  bytes, 
zeros  the  count  in  the  queue  entry. 

■  Another  handler  might  execute  for  another  active  intermpt  task. 

■  Interrupt  function  creates  new  queue  entry  for  the  next  series  of  received  data 
bytes. 

■  High-level  handler  gets  called  to  remove  the  bytes  after  the  process  Dispatcher 
has  finished  with  the  current  queue  entry. 

Figure  4-7  illustrates  this  process.  The  occurrence  of  Interrupt- 1  causes  the  on¬ 
going  User  process  to  be  interrupted,  and  initiates  execution  of  its  interrupt  handler. 
The  process  builds  the  first  queue  entry  and  initiates  execution  of  the  Process  Dis¬ 
patcher.  The  Dispatcher  passes  control  to  High-level  Handler- 1,  which  begins  execu¬ 
tion. 

This  handler  is  interrupted  by  the  occurrence  of  Interrupt-2  and  Interrupt- 1 
events  as  it  executes  between  these  interruptions.  When  Handler- 1  completes,  it  re- 
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turns  control  to  the  Process  Dispatcher,  which  selects  the  next  queue  entry  and  turns 
control  over  to  high-level  Handler-2. 

During  this  execution,  one  more  Interrupt- 1  event  occurs,  which  results  in  the 
creation  of  another  queue  entry.  This  entry  is  processed  when  high-level  Handler-2 
finishes  its  execution  and  the  Process  Dispatcher  again  receives  control. 

High-level  Handler- 1  processes  the  remaining  data  and  returns  control  to  the 
Process  Dispatcher  which,  upon  finding  no  more  queue  entries,  returns  to  the  inter¬ 
rupted  user  process. 

Each  execution  of  the  interrupt  processes  Interrupt- 1  or  Interrupt-2,  as  well  as 
the  Process  Dispatcher  and  high-level  Handler- 1  and  high-level  Handler-2  code  seg¬ 
ments,  is  quite  short;  however,  with  the  short  execution  approach,  individual  inter¬ 
rupt  priorities  are  not  taken  into  account.  If  priority  handling  of  interrupts  is  impor¬ 
tant,  a  different  approach  is  needed.  For  example,  entries  could  be  linked  into  a  single 
queue,  with  their  position  in  the  queue  determined  by  their  priority.  In  this  case,  more 
sophisticated  queue  handling  procedures  would  have  to  be  implemented;  however,  a 
given  high-level  handler  would  still  execute  to  completion  before  another  handler  is 
given  control. 

To  handle  fully-nested  priority-oriented  interrupts,  that  is  the  ability  of  a  higher 
priority  interrupt  to  preempt  the  execution  of  a  lower  priority  handler,  requires 
an  interrupt  stack  (possibly  with  the  support  of  a  interrupt  context  cache).  It  is  ques¬ 
tionable  whether  the  responsiveness  of  the  nested  interrupt  technique  would  override 
the  increased  overhead  of  saving  and  restoring  many  sets  of  registers. 

In  the  approach  shown  in  the  previous  code  only  nine  global  registers  are  re¬ 
quired.  These  serve  for  all  interrupt  handlers  in  the  system.  During  the  execution 
of  the  Freeze-mode  interrupt  handler  only  four  intermpt  temporary  registers  are  used 
(. itO-it-3 ). 

4.3.13  Making  Timer  Interrupts  Synchronous 

The  29K  on-chip  timer  can  be  configured  to  generate  an  interrupt  when  the 
Timer  Counter  Register  (TCR)  decrements  to  zero;  more  accurately,  when  the  24-bit 
TCV  field  of  the  TCV  register  reaches  zero.  The  TCV  field  is  decremented  with  each 
processor  cycle;  when  it  reaches  zero,  it  is  loaded  with  the  Timer  Reload  Value  field 
(TRV)  in  the  Timer  Reload  register  (TR). 

When  the  Interrupt  Enable  bit  (IE)  of  the  TR  register  is  set  and  the  TCV  reaches 
zero,  the  processor  will  take  a  timer  intermpt  unless  the  DA  bit  is  set  in  the  Configu¬ 
ration  Register  (CFG).  Two-bus  and  microcontroller  members  of  the  29K  family  can 
additionally  disable  timer  interrupts  by  setting  the  TD  bits  in  the  CPS  register.  Timer 
interrupts  are  not  disabled  by  setting  the  DI  bit  in  the  CFG.  This  means  timer  inter¬ 
rupts  can  not  be  simply  disabled  along  with  other  external  asynchronous  interrupts 
by  setting  DI.  Note,  the  TRAP[1,0]  asynchronous  inputs  are  not  disabled  by  setting 
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the  DI  bit.  For  this  reason,  the  use  of  TRAP[1,0]  pins  requires  complex  software  sup¬ 
port.  It  is  best  to  avoid  the  use  of  these  input  pins. 

It  is  often  desirable  to  disable  timer  interrupts  during  critical  code  stages,  be¬ 
cause  timer  interrupts  often  lead  to  such  tasks  as  context  switching.  However,  timer 
interrupts  may  be  required  to  support  a  real-time  clock,  and  to  maintain  accuracy,  a 
timer  interrupt  can  not  be  missed.  The  timer  interrupt  must  be  taken  but  processing 
the  event  can  be  postponed  till  later,  when  it  can  be  dealt  with.  To  do  this  efficiently, 
the  Freeze  mode  interrupt  handler  for  the  timer  should  set  register  ast  to  true.  This 
register  is  a  kernel  space  support  register  chosen  from  the  range  ks0-ksl5 
(, gr80-gr95 ).  It  indicates  an  Asynchronous  Software  Trap  (AST)  is  ready  for  proces¬ 
sing.  The  ast  register  can  be  quickly  tested  with  an  ASSERT  type  instruction,  as 
shown  below: 


mf  sr 

andn 

mtsrim 

asneq 

iret 


itO, ops 
itO, itO, 1 
ops, itO 
V_AST , ast, 0 


; get  OPS  register,  DA  already  clear 

; clear  DI  bit 

; enable  interrupts 

;trap  if  ast  !=  0,  timer  'event' 

; otherwise  iret 


Clearing  the  DI  bit  reenables  asynchronous  interrupts  (with  the  exception  that 
TRAP[1,0]  are  already  active);  but  we  must  check  to  see  if  an  AST  is  pending  (timer 
event).  The  high  level  timer  processing  is  performed  before  the  IRET  instruction  is 
executed,  via  trapware  supporting  the  V_AST  trap. 


4.4  USER-MODE  INTERRUPT  HANDLERS 

Many  present  day  operating  systems  allow  interrupt  handlers  to  be  written  in 
high-order  languages.  User  mode  routines  for  29K  Processor  Family  based  systems 
are  no  different.  When  providing  this  facility,  the  operating  system  designer  must  be 
aware  of  the  following  concerns. 

■  User  mode  programs  are  often  written  by  programmers  who  lack  specific 
knowledge  of  the  operating  system  and  it’s  allocation  of  global  registers. 

■  The  User  mode  handler,  when  written  in  a  high-level  language,  such  as  C,  will 
require  access  to  the  local  register  stack,  as  well  as  global  registers  defined  for 
its  management. 

A  good  approach  for  addressing  these  concerns  is  to  perform  all  necessary  regis¬ 
ter  saving,  with  interrupts  disabled,  while  in  Supervisor  mode;  remove  the  cause  of 
the  interrupt,  then  enable  interrupts  and  enter  User  mode  to  execute  the  user’s  inter¬ 
rupt  handler  code.  This  allows  interrupt  (signal)  handlers  to  be  compatible  with 
AMD’s  Host  Interface  ( HIF )  v2.0  Specification  (see  section  2.2),  which  includes  the 
definition  of  operating  system  services  .  These  services  install  and  invoke  user-sup- 
plied  interrupt  handlers  for  floating-point  exceptions  and  keyboard  interrupt 
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(SIGFPE  and  SIGINT)  events.  It  also  allows  the  operating  system  to  perform  its  own 
register  preservation  and  restoration  processes,  without  burdening  the  user  with  tech¬ 
nical  operating  system  details.  Complete  listings  of  the  code  contained  in  this  section 
are  provided  in  Appendix  B  and  also  by  AMD  in  their  library  source  code.  Users  who 
intend  to  modify  any  of  this  code  should  bear  in  mind  that  the  SPILL,  FILL,  setjmp, 
longjmp,  and  Signal  Trampoline  code  are  highly  interdependent. 

The  code  uses  an  expanded  interface  definition  that  uses  a  set  of  global  registers 
to  hold  the  important  local  register  stack  support  values  emitted  by  compiler  gener¬ 
ated  code  in  User  mode  programs.  The  registers  defined  for  this  environment  are 
shown  in  Table  4-2,  and  were  discussed  in  detail  in  section  3.3  (page  117). 


Table  4-2.  Expanded  Register  Usage 


Names 

Registers 

Usage  Description 

tav 

gr121 

Trap  Argument  Vector 

tpc 

gr122 

Trap  Return  Pointer 

Irp 

gr123 

Large  Return  Pointer 

sip 

gr124 

Static  Link  Pointer 

msp 

gr125 

Memory  Stack  Pointer 

rab 

gr126 

Register  Allocate  Bound 

rfb 

gr127 

Register  Free  Bound 

In  order  to  prepare  for  execution  of  a  User  mode  handler,  the  HIF  specification 
indicates  that  the  Supervisor  mode  portion  of  the  handler  must  save  important  regis¬ 
ters  in  the  user’s  memory  stack,  as  shown  in  Figure  4-8.  In  the  figure,  the  stack  point¬ 
er  (msp )  is  shown  decremented  by  48  bytes  (12  registers  times  4  bytes  each),  and 
positioned  to  point  to  the  saved  value  of  register  tav. 

Other  registers  may  need  to  be  saved  to  allow  complete  freedom  in  executing 
29K  processor  instructions  (such  as  multiply  or  divide  trap  routines)  in  the  User¬ 
mode  handler  code.  Candidates  for  saving  are  the  Indirect  Pointers  (IPA-IPC),  the  Q 
register,  the  stack  frame  pointer,^  (/r /),  and  the  local  register  stack  bounds  in  rfb.  In 
addition,  because  high-level  languages  use  many  of  the  global  registers  as  tempo¬ 
raries,  these  ( gr96-grl24 )  may  also  have  to  be  saved. 

4.4.1  Supervisor  mode  Code 

When  an  interrupt  occurs,  the  supervisor  portion  of  the  intermpt  handler  is 
executed.  This  code  is  responsible  for  saving  important  processor  registers,  as  shown 
in  Figure  4-8.  The  assembler  macro  instructions,  used  earlier  in  this  chapter  (push, 
pop,pushsr  and popsr ),  and  described  in  detail  in  section  3.3.1  (page  119),  are  used  in 
the  following  code  examples  to  aid  in  pushing  special  registers  onto  the  memory 
stack. 
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User’s  Memory  Stack 

The  stack  is  shown  with 
higher  addresses  at  the 
top  of  the  figure,  and 
lower  addresses  at  the 
bottom. 


Register  msp  points  to 
the  last  register  saved 
by  the  Supervisor 
mode  portion  of  the 
handler  when  control 
is  given  to  the  User 
mode  code. 


Figure  4-8.  Saved  Registers 

The  code  to  save  the  registers  is  executed  in  Supervisor  mode,  with  Freeze  mode 
enabled,  as  indicated  in  prior  section  4.3.7.  This  ensures  that  a  higher  priority  inter¬ 
rupt  does  not  disrupt  this  critical  section  of  code.  The  code  is  shown  below. 


;  supervisor  portion  of  interrupt  handler 

r 

sigint : 

jmp  interrupt 

const  itO, 2  ; SIGINT 

r 

sigfpe : 

const  itO,  8  ; SIGFPE 

interrupt : 

sub 
store 
sub 
store 
sub 
store 
const 
sub 

t 

pushsr  msp,itO,PCO 
pushsr  msp, itO,  PCI 
pushsr  msp,itO,PC2 
pushsr  msp, itO,  CHA 
pushsr  msp,itO,CHD 
pushsr  msp,itO,CHC 
pushsr  msp,itO,ALU 


msp, msp, 4 

0,0,it0,msp  ;  save  signal  number 

msp, msp, 4 

0,0,grl,msp  ;push  grl 

msp,  msp, 4 

0,0,rab,msp  ;push  rab 

itO, 512 

rab, rfb, itO  ; set  rab  =  rfb-512 
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pushsr  msp,itO,OPS 


sub  msp, msp, 4 

store  0,0,tav,4  ;push  tav 

trampoline : 


At  this  point  in  the  code,  with  all  of  the  critical  registers  saved,  the  memory  stack 
will  appear  as  shown  in  Figure  4-8.  When  the  User  mode  interrupt  handler  is  com¬ 
plete,  these  registers  will  be  restored. 

Special  provisions  were  made  in  the  code  above  in  anticipation  of  the  following 
situation:  If  a  FILL  operation  is  interrupted,  and  the  trampoline  code  has  not  yet 
realigned  the  rab  register  to  r/fo-WindowSize.  another  interrupt  occurring  at  that 
point  could  again  activate  the  trampoline  code.  This  interrupt  could  cause  the  tram¬ 
poline  code  to  assume  that  a  FILL  operation  was  in  progress,  thereby  causing  it  to 
“reposition”  the  value  in  PCI  to  recommence  the  (assumed)  FILL  operation. 


;  Now  come  off  freeze,  and  go  to  user-mode  code. 
;  ensure  load/store  does  not  restart 


trampoline : 

mtsrim 

const 

consth 

load 

cpeq 

jmpt 

add 

mtsr 

mtsr 

iret 


chc,  0 

itl, RegSigHand 
itl, RegSigHand 
0,0, itl, itl 
itO, itl, 0 
itO, SigDfl 
itO, itl, 4 
pci, itl 
pcO, itO 


; ensure  load/store 
;  does  not  restart 


; jump  if  no  handler (s) 


Two  types  of  interrupts  are  handled  by  this  code:  keyboard  interrupts  and  float¬ 
ing-point  exceptions.  It  is  assumed  that  the  interrupt  vectors  were  previously  set  to 
vector  to  either  sigint  or  sigfpe,  depending  on  the  type  of  interrupt.  Interrupt  tempo¬ 
rary  (itO  )  is  used  to  contain  the  type  of  interrupt  (signal),  when  entering  the  common 
code  at  label  interrupt. 

Once  the  memory  stack  is  set  up  as  indicated,  the  User  mode  portion  of  the  han¬ 
dler  (beginning  at  label  sigcode)  is  placed  into  execution  by  loading  Program  Count¬ 
ers  (PCO  and  PCI)  with  the  address  of  the  handler.  Then  while  still  in  Freeze  mode 
with  interrupts  disabled,  an  IRET  instruction  is  executed  to  begin  execution  of  the 
handler. 

The  HIF  specification  indicates  that  User  mode  signal  handlers  must  call  one  of 
the  specified  signal  return  services  to  return  control  to  the  user’s  code  at  the  appropri¬ 
ate  point.  When  one  of  these  services  (sigret,  sigrep,  sigdfl,  or  sigskp)  is  called  via  an 
ASSERT  instruction,  msp  will  point  to  the  same  location  shown  in  Figure  4-8,  so  the 
supervisor  portion  of  the  handler  can  properly  restore  the  interrupted  task’s  environ¬ 
ment. 
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The  following  code  fragment  illustrates  how  one  of  the  return  services  restores 
all  of  the  registers.  It  is  invoked  by  the  HIF  Service  Trap  (69)  with  interrupts  disabled 
and  Freeze  mode  enabled — as  is  the  case  with  any  interrupt  or  trap. 

;  Signal  return  service,  restore  registers 

r 

sigret : 

;  assume  msp 
load 
add 


popsr 

OPS, itO, msp 

;pop  specials 

popsr 

ALU, itO, msp 

popsr 

CHC, itO, msp 

popsr 

CHD , itO, msp 

popsr 

CHA, itO, msp 

popsr 

PC2 , it 0 , msp 

popsr 

PCI, itO, msp 

popsr 

PCO, itO, msp 

load 

0,0, rab, msp 

;pop  rab 

add 

msp, msp, 4 

load 

0,  0, itO, msp 

; pop  rsp 

add 

grl, itO, 0 

add 

msp, msp, 8 

; discount  signal 

iret 

;  number 

As  indicated  in  the  HIF  Specification,  User  mode  interrupt  handlers  must  save  a 
number  of  additional  registers,  to  prepare  for  executing  high-level  language  code. 
The  following  section  discusses  some  of  the  necessary  preparations. 

4.4.2  Register  Stack  Operation 

The  29K  Processor  Family  contains  128  general  registers  that  can  be  configured 
as  a  register  stack.  In  this  case,  global  register  ( grl )  is  used  to  point  to  the  first  register 
in  this  group  that  belongs  to  the  current  process.  This  first  register  is  addressed  as  IrO 
(local  register  0). 

Several  additional  global  registers  provide  other  information  describing  the 
register  stack  bounds.  These  are  all  shown  in  Figure  4-9,  which  illustrates  the  imple¬ 
mentation  of  the  local  register  file  as  a  shadow  copy  of  the  memory-based  register 
stack  cache. 

The  rab,  rsp,  Irl,  and  rfb  registers  (shown  in  Figure  4-9)  contain  the  bounds  of 
the  current  memory  stack  cache  in  the  form  of  addresses. 

The  rsp  (register  stack  pointer)  shown  in  Figure  4-9  is  assigned  to  global  regis¬ 
ter  grl ,  whose  low-order  9  bits  (bits  0  and  1  are  not  used  for  register  addressing)  are 
used  to  address  the  local  register  file  whenever  a  local  register  number  is  encountered 
in  a  29K  processor  instruction.  Therefore,  local  register  lr2  is  actually  referenced  by 
the  CPU,  by  adding  8  (2  words)  to  the  value  held  in  register  grl . 

Other  important  details  of  the  register  stack  and  local  register  file  are  discussed 
in  Chapter  7  ( Operating  System  Issues). 


points  to  tav 

0,0,tav,msp  ; restore  tav 

msp,  msp,  4 
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gr127  - 
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Activation 
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Figure  4-9.  Register  and  Stack  Cache 


The  important  concern  in  writing  interrupt  handlers  that  use  local  registers  is 
that  the  local  register  file  bounds  and  contents  at  the  time  of  an  interrupt  reflect  the 
current  state  of  the  interrupted  program. 

For  example,  looking  at  Figure  4-9,  when  an  application  calls  a  function,  the  ac¬ 
tivation  record  for  the  new  function  is  allocated  immediately  below  the  current  rsp; 
occupying  part  of  the  register  file  whose  corresponding  section  is  indicated  as  “un¬ 
used.”  If  the  new  activation  record  is  larger  than  the  currently  unused  space  (i.e.,  rsp 
is  decremented  to  point  below  the  current  value  in  the  rab  register),  the  stack  is  said  to 
have  overflowed.  When  this  overflow  occurs,  some  of  the  existing  registers  in  the 
local  register  file  must  be  “spilled”  to  make  room  for  the  new  activation  record.  The 
number  of  registers  involved  in  the  “spill”  must  be  sufficient  to  allow  the  entire  new 
activation  record  to  be  wholly  contained  in  the  local  register  stack. 

A  similar  situation  occurs  when  a  called  function  is  about  to  return  to  its  caller 
and  the  entire  activation  record  of  the  caller  is  not  currently  contained  in  the  local 
register  file.  In  this  case,  the  portion  of  the  caller’s  activation  record  not  located  in  the 
register  file  must  be  “filled”  from  the  memory  stack  cache.  Management  of  the  local 
register  file  requires  the  use  of  User  mode  functions  that  perform  SPILL  and  FILL 
operations,  in  concert  with  a  Supervisor  mode  trap  handler  when  the  SPILL  or  FILL 
operation  is  needed. 


Chapter  4  Interrupts  and  Traps 


227 


4.4.3  SPILL  and  FILL  Trampoline 

High-level  language  compilers  automatically  generate  code  that  tests  for  a  re¬ 
quired  SPILL  upon  entry  to  a  called  function,  and  for  a  required  FILL  operation  just 
before  a  called  function  exits.  In  either  case,  the  SPILL  or  FILL  is  initiated  by  an 
ASSERT  instruction  whose  assertion  fails.  This  causes  the  SPILL  or  FILL  trap  han¬ 
dler  to  begin  its  execution  in  Supervisor  mode  with  special  registers  frozen. 

The  Supervisor  mode  code  must  initiate  execution  of  the  appropriate  handler  by 
leaving  Supervisor  mode,  and  doing  its  processing  in  User  mode.  Several  benefits  are 
obtained  from  operating  SPILL  or  FILL  handlers  in  User  mode.  First,  the  overhead 
of  leaving  Freeze  mode  is  avoided,  handlers  must  leave  Freeze  mode  because  they 
require  the  use  of  load-  and  store-multiple  instructions.  Additionally,  FILL  and 
SPILL  handlers  may  require  several  machine  cycles  to  complete,  if  they  were  to  op¬ 
erate  with  DA  set,  a  potential  interrupt  latency  problem  would  result. 

The  following  entry  points,  SpillTrap  and  FillTrap  are  directly  invoked  by 
their  corresponding  hardware  vectors  when  the  associated  ASSERT  instruction 
is  executed.  The  operands  SpillAddrReg  and  FillAddrReg  are  aliased  to  kernel  stat¬ 
ic  registers  (two  of  ks0-ksl5 ),  which  hold  the  addresses  of  the  User  mode  SPILL  and 
FILL  handlers. 

Because  the  processor’s  execution  jumps  from  Supervisor  mode  to  User  mode 
in  this  fashion,  the  SpillTrap  and  FillTrap  code  is  called  a  trampoline.  The  Spill¬ 
Trap  and  FillTrap  trampoline  code  is  shown  below. 


SpillTrap : 

r 

;  Preserve  the  return  address  in  the 

;  designated  register 
mfsr  tpc,PCl 

r 

;  Fixup  PCO  and  PCI  to  point  at  the  user 

;  designated  spill  handler 

mtsr  PCI , SpillAddrReg 
add  tav, SpillAddrReg, 4 

mtsr  PCO, tav 

r 

;  And  return  to  that  handler 
iret 


FillTrap : 

r 

;  Preserve  the  return  address  in  the 
;  designated  register 
mfsr  tpc,PCl 

f 

;  Fixup  PCO  and  PCI  to  point  at  the  user 
;  designated  fill  handler 
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mtsr  PCI , FillAddrReg 
add  tav, FillAddrReg, 4 

mtsr  PCO,tav 

t 

;  And  return  to  that  handler 
iret 

The  SpillTrap  and  FillTrap  routines  both  turn  control  over  to  the  User  mode 
sections  of  their  respective  handlers  by  modifying  the  addresses  held  in  the  proces¬ 
sor’s  frozen  PCO  and  PCI  registers.  This  happens  after  the  current  address  in  PCI  has 
been  temporarily  saved  in  register  tpc  ( grl22 ). 

When  the  IRET  instruction  is  executed,  the  processor  reenters  User  mode,  with 
the  same  intermpt  enable  state  as  when  the  trap  occurred,  and  begins  execution  at  the 
address  loaded  into  PCI. 

4.4.4  SPILL  Handler 

The  FILL  and  SPILL  handlers  are  executed  in  User  mode  to  ensure  the  greatest 
processor  performance  for  these  operations.  The  handlers  are  invoked  by  the  Super¬ 
visor  mode  trap  handler,  usually  with  interrupts  enabled.  This  permits  SPILL  and 
FILL  operations  to  be  interrupted,  and  to  use  load-  and  store-multiple  operations  to 
accomplish  their  task. 

An  example  User  mode  SPILL  handler  is  shown  below. 


;  spill  handler 

/ 

;  spill  registers  from  (*grl-*rab) 

;  and  move  rab  down  to  where  grl  points. 

r 

;  On  entry:  rfb  -  rab  =  windowsize, 

;  grl  <  rab. 

t 

;  Near  the  end:  rfb  -  rab  >  windowsize, 

;  grl  ==  rab 

;  On  exit:  rfb  -  rab  =  windowsize, 

;  grl  ==  rab 

f 

.global  spill_handler 
spill_handler : 


sub 

tav, rab, grl 

; bytes  to 

spill 

srl 

tav, tav, 2 

; bytes  to 

words 

sub 

tav, tav, 1 

;make  zero 

based 

mtsr 

CR, tav 

; set  CR  register 

sub 

tav, rab, grl 

sub 

tav, rfb, tav 

;dec.  rfb 

by  tav 

add 

rab, grl , 0 

; copy  rsp 

into  rab 

storem 

0 , 0 , IrO , tav 

; store  IrO 

. . lr (tav) 

jmpi 

tpc 

; return . . . 

add 

rfb, tav, 0 
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In  the  above  code,  that  the  condition  for  entry  is  that  global  register  grl  ( rsp )  has 
already  been  decremented  to  a  value  less  than  the  current  value  in  rab.  This  lower 
value  is  what  signals  the  need  to  spill  some  registers.  The  order  in  which  the  manage¬ 
ment  registers  are  changed  by  the  SPILL  handler  is  very  important,  particularly  if  an 
interrupt  were  to  occur  during  the  SPILL  operation.  In  this  case,  register  rab  must  be 
changed  before  rfb. 

The  value  in  register  rab  is  maintained  for  convenience,  and  performance  gain; 
it  is  a  cache  of  the  r/fo- WindowSize  value.  The  rfb  register  is  the  anchor  point  for 
local  register  file  (cache)  and  memory  resident  register- stack  crossover. 

4.4.5  FILL  Handler 

The  FILL  handler  is  similar  to  the  SPILL  handler,  except  that  bytes  are  moved 
from  the  memory  stack  to  the  local  register  file.  This  handler  is  initiated  when  the 
value  in  Irl  is  larger  than  the  current  value  in  the  rfb  register. 

;  fill  registers  from  [*rfb..*lrl) 

;  and  move  rfb  upto  where  Irl  points. 

r 

;  On  entry:  rfb  -  rab  =  windowsize, 

;  Irl  >  rfb 


;  Near  the  end 

rfb  -  rab  < 

windowsize, 

r 

Irl  ==  rab  + 

windowsize 

;  On  exit :  rfb 

-  rab  =  windowsize, 

t 

Irl  ==  rfb 

r 

. global 
f ill_handler : 

f ill_handler 

const 

tav, (0x80<<2) 

or 

tav, tav, rfb 

; tav= [ rfb] <<2 

mtsr 

IPA, tav 

; ipa  =  [rfb] <<2 

sub 

tav, Irl , rfb 

;tav  =  byte  count 

add 

rab, rab, tav 

;push  up  rab 

srl 

tav, tav, 2 

;word  count 

sub 

tav, tav, 1 

; zero  based 

mtsr 

CR, tav 

; set  CR  register 

loadm 

0,0, grO , rfb 

; load  registers 

jmpi 

tpc 

; return . . . 

add 

rfb, Irl, 0 

; . . .pushing  up  rfb 

In  the  case  of  a  fill  condition,  the  rfb  register  must  be  changed  only  after  the 
FILL  operation  is  complete;  however,  the  rab  register  is  modified  prior  to  execution 
of  the  LOADM  instruction.  That  is,  the  anchor  point  indicated  by  register  rfb  must  be 
updated  only  after  the  data  transfer  has  been  accomplished. 
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4.4.6  Register  File  Inconsistencies 

The  discussion  of  SPILL  and  FILL  User  mode  handlers  is  important  when 
writing  interrupt  routines  because  a  SPILL  or  FILL  may  be  incomplete  at  the  time 
the  interrupt  occurs.  Depending  on  whether  a  SPILL  or  FILL  is  in  progress,  the  in¬ 
terrupt  handler  must  prepare  the  register  stack  support  registers  before  attempting  to 
pass  control  to  a  User  mode  handler  that  makes  use  of  the  local  register  file. 

Figure  4-10  illustrates  a  global  view  of  the  register  stack,  as  it  might  appear  both 
in  the  local  registers  and  in  the  memory  stack  cache  at  the  time  of  an  interrupt.  In  this 
case,  the  interrupt  occurred  during  execution  of  a  SPILL  operation,  probably  during 
execution  of  the  STOREM  instruction.  Therefore,  the  address  in  register  grl  has  al¬ 
ready  been  decremented  in  anticipation  that  the  proposed  activation  record  will  fit  in 
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Figure  4-10.  Stack  Upon  Interrupt 
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the  local  registers.  In  addition,  because  a  SPILL  operation  was  necessary,  the  rab 
register  has  also  been  set  equal  to  grl  in  the  SPILL  handler. 

The  interrupt  handler  must  recognize  this  condition  because  it  must  prepare  the 
register  stack  for  entry  into  a  C  language  user  interrupt  function.  This  will  require  the 
stack  management  registers  to  be  consistent.  Repairing  stack  inconsistencies  de¬ 
pends  on  the  interrupt  handler  being  able  to  recognize  each  unique  situation  where 
such  an  inconsistency  could  occur.  In  the  case  of  the  C  language  environment,  there 
are  three  situations  that  must  be  detected. 

■  The  interrupt  occurred  when  a  SPILL  was  in  progress,  in  which  case  the  dis¬ 
tance  between  the  values  in  the  rfb  and  rab  registers  exceeds  the  size  of  the  local 
register  file  (referred  to  as  the  WindowSize). 

■  The  interrupt  occurred  when  a  FILL  operation  was  in  progress,  in  which  case 
the  distance  between  the  values  in  the  rfb  and  rab  registers  is  less  than  the  size  of 
the  local  register  file. 

■  The  interrupt  occurred  during  a  far-longjmp  operation  (see  Figure  4- 12a),  in 
which  case  the  value  {grl  +  8)  —  which  is  the  address  of  local  register  lr2  on  the 
register  memory  stack  —  is  greater  than  the  value  in  the  rfb  register. 

The  following  code  fragment  illustrates  a  method  of  recognizing  these  inconsis¬ 
tent  stack  conditions. 

The  Supervisor  mode  portion  of  the  interrupt  handler  has  saved  the  important 
processor  registers  as  shown  in  Figure  4-8.  Because  the  User  mode  portion  of  the 
handler  is  intended  to  execute  a  C  language  function,  additional  registers  will  need  to 
be  saved.  The  register  stack  support  registers,  indirect  pointers  (IPA-IPC),  as  well  as 
global  registers  ( gr96-grl24 )  are  pushed  onto  the  memory  stack  just  below  the  sig¬ 
nal  context  frame. 


sigcode : 

push 

push 

push 

sub 


msp, lrl 
msp, rfb 
msp, msp 
msp, msp, 3*4 


;push  R-stack 
;  support 
;M-stack  support 
; Floating  Point 


pushsr  msp,tav,IPA  ;User  mode  specials 

pushsr  msp,tav,IPB 
pushsr  msp,tav,IPC 
pushsr  msp,tav,Q 


sub  msp, msp, 29*4  ;push  gr96-grl24 

mtsrim  cr,29-l 
storem  0,0,gr96,msp 


Additional  space  on  the  memory  stack  is  allocated  for  floating  point  registers.  If 
the  C  language  signal  handler  is  to  make  use  of  floating  point  resources  then  the  nec¬ 
essary  critical  support  registers  should  be  saved.  Further  discussion  of  these  and  an 
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explanation  of  the  format  of  the  saved  context  information  can  be  found  in  Chapter  5 
(i Operating  System  Issues).  After  the  additional  context  status  has  been  saved  the  reg¬ 
ister  stack  condition  can  then  be  examined. 

/Recognize  inconsistent  stack  conditions 


const 

consth 

load 

gr96, WindowSize/ get  cache  size 
gr96, WindowSize 

0,0, gr96, gr96 

add 

gr 98 , msp, SIGCTX_ 

_RAB 

load 

0 , 0 , gr 98 , gr 98 

/ interrupted  rab 

sub 

cpgeu 

gr 97 , rfb, gr98 
gr 97 , gr 97 , gr 96 

/ rfb-rab  <=  WS 

jmpt 

gr97, nfill 

/ jmp  if  spill 

add 

gr97, grl, 8 

/or  normal  stack 

cpgtu 

gr97, gr97, rfb 

/longjmp  test 

jmpt 

nop 

gr 97 , nfill 

/yes,  longjmp  ca 
/ jmp  if  grl+8  > 

; Fixup  registers  to  re-start  FILL  operation 


add 

push 

const 

consth 

push 

sub 

push 

const 

sub 

push 


gr96, msp, SIGCTX_RAB+4 
gr96,rab  ; resave 

gr 98 , f ill_handler+4 
gr98, f ill_handler+4 
gr96,gr98  ; resave 

gr98, gr98, 4 

gr96,gr98  ; resave 

gr 98 , 0 

gr96,  gr 96, 3*4  /point 
gr96,gr98  /resave 


rab=rfb-512 


PCO 

PCI 

to  CHC 
CHC=0 


The  variable  WindowSize  is  initialized  to  the  size  of  the  local  register  stack,  in 
bytes,  when  the  library  signal  function  is  first  called.  Referring  to  Figure  4-10,  and  to 
the  code  fragment  shown  above,  it  is  clear  that  the  result  of  subtracting  the  saved  rab 
from  rfb  will  be  larger  than  the  local  register  stack  size.  Therefore,  the  program  will 
handle  the  spill  (and  normal  interrupt)  cases  by  jumping  to  label  nfill.  The  longjmp 
case,  once  detected,  is  also  sent  to  the  nfill  label,  where  the  code  discriminates  be¬ 
tween  the  conditions. 


/discriminate  between  SPILL, 
/  normal  interrupts 
nfill : 


cpgtu 

jmpt 

cpltu 

jmpt 

nop 

sendsig : 


gr96, grl, rfb 
gr96, lower 
gr 96 , grl , rab 
gr96, raise 


longjmp  and 


/ if  grl  >  rfb 
/then  grl  =  rfb 
/ if  grl  <  rab 
/then  grl  =  rab 


In  the  interrupted  FILL  case,  the  saved  rab  value  is  over- written  with  the  re¬ 
aligned  rab  value.  The  send-signal  code  (section  4.4.1)  subtracted  the  WindowSize 
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from  the  value  in  rfb  to  determine  the  aligned  rab  value.  This  was  done  before  issuing 
an  IRET  to  sigcode. 

Essentially,  this  restores  rab  to  where  it  pointed  immediately  before  executing 
the  function  call  that  caused  the  FILL  operation.  Note  that  this  recomputation  is  also 
valid  for  a  normal  case,  where  the  management  registers  are  consistent. 

The  two  comparisons  shown  below  determine  which  method,  if  any,  should  be 
used  to  repair  the  value  in  register  grl .  The  method  depends  on  whether  a  longjmp, 
SPILL,  or  normal  interrupt  occurred.  This  is  required  to  align  grl  to  a  valid  cache 
position  where  longjmp  or  SPILL  is  interrupted.  The  following  code  fragment 
shows  the  code  associated  with  the  lower  and  raise  labels. 

; lower  or  raise  value  in  grl 
lower : 


jmp 

sendsig 

add 

grl , rfb, 0 

; set  grl  =  rfb 

jmp 

sendsig 

add 

grl , rab, 0 

; set  grl  =  rab 

According  to  the  situation  depicted  in  Figure  4-10,  when  a  SPILL  operation  is 
interrupted,  code  at  the  raise  label  is  executed;  however,  the  code  resumes  at  the  label 

sendsig. 

The  code  fragment  titled  “fix-up  registers  to  restart  FILL  operation”,  shown 
above,  is  entered  if  the  interrupt  occurred  during  a  FILL  operation.  If  so,  it  is  neces¬ 
sary  to  change  the  saved  values  for  the  Program  Counters,  PCO  and  PC  1 ,  and  clear  the 
value  saved  in  the  CHC  register.  These  registers  are  assumed  to  have  been  saved  in 
the  order  shown  in  Figure  4-8.  This  is  required  in  addition  to  realigning  the  register 
stack  support  register,  rab. 

The  identifiers  called  SIGCTX_RAB,  and  SIGCTX_SIG  are  defined  as  numer¬ 
ic  offsets  (to  be  added)  to  the  memory  stack  address  held  in  register  msp.  Making 
these  changes  will  effectively  restart  the  FILL  operation  from  its  beginning.  This 
code  also  falls  into  the  code  beginning  at  label  nfill,  but  in  the  case  of  an  interrupted 
FILL  operation,  the  value  in  register  grl  will  not  be  adjusted. 

4.4.7  Preparing  the  C  Environment 

After  stack  repairs  have  been  made  to  the  (possibly  inconsistent)  management 
registers,  it  is  necessary  to  prepare  for  C  language  interrupt  handler  code  execution. 
These  preparations  consist  mainly  of  setting  up  a  new  stack  frame  from  which  the 
user’s  handler  can  execute.  At  this  point  in  the  process,  the  register  stack  and  memory 
cache  appear  as  shown  in  Figure  4-11. 

The  following  code  fragment  picks  up  at  the  label  sendsig,  which  is  repeated  for 
clarity.  The  handler  is  almost  ready  to  pass  control  to  the  user’s  C  language  handler 
code,  but  first  it  must  set  up  a  stack  frame  that  looks  as  though  the  user’s  function  was 
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memory  stack  cache 


local 
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file 
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b - H 
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i _ i 
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Figure  4-11.  Stack  After  Fix-up 

called  in  a  normal  fashion  (rather  than  being  invoked  as  part  of  an  interrupt  handler). 
This  is  accomplished  in  the  same  way  a  normal  C  language  function  allocates  its 
stack  frame  upon  entry. 


;  Create  an  activation  record  on  the  stack 
;  for  our  handler,  so  the  user  code  will 
;  operate  as  though  it  has  been  "called" 

t 

. equ  RALLOC, 4*4  ; space  for  function 

sendsig : 

sub  grl , grl , RALLOC 

asgeu  V_SPILL,  grl ,  rab 

add  lrl,rfb,0  ; set  lrl  =  rfb 

add  gr 97 , msp, SIGCTX_SIG 

load  0,0,lr2,gr97  ; restore  sig  number 

sub  gr97,lr2,l  ; get  handler  index 

sll  gr97,gr97,2  ;point  to  addresses 

/Handler  must  not  use  HIF  services  other  /than  the  _sigret()  type, 
const  gr96, SigEntry 

consth  gr96, SigEntry 

add  gr96, gr96, gr97 

load  0, 0, gr96, gr96  /registered  handler 

cpeq  gr97,gr96,0 

jmpt  gr 97 , NoHandler 

nop 
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calli  Ir0,gr96  ;call  C-level 

nop  ; signal  handler 

NoHandler : 

jmp  _ sigdfl 

nop 

The  user  function  called  by  the  above  code  is  assumed  to  be  one  that  has  been 
passed  to  the  signal  library  function  to  process  either  SIGINT  or  SIGFPE  interrupts, 
or  both.  The  SigEntry  label  in  the  above  code  refers  to  a  table  of  pointers.  In  the  ex¬ 
ample,  one  contains  the  address  of  a  user  signal  handler  for  keyboard  interrupts  (SI¬ 
GINT)  and  the  other  points  to  the  handler  for  floating-point  exceptions  (SIGFPE).  A 
pointer  to  the  user  handler  for  each  of  these  is  installed  in  the  SigEntry  table  by  the 
signal  library  function. 

4.4.8  Handling  Setjmp  and  Longjmp 

Although  not  strictly  related  to  interrupt  handling,  many  C  language  libraries 
contain  a  setjmp  routine  used  to  record  the  values  of  the  register  and  memory  stack 
support  registers,  and  an  additional  longjmp  routine  that  allows  a  program  to  jump  to 
a  consistent  environment  saved  by  a  previous  call  to  the  setjmp  routine. 

Figure  4-12  illustrates  the  location  in  the  stack  and  memory  cache  to  which  the 
saved  information  from  a  previously  executed  setjmp  call  might  refer.  The  saved  in¬ 
formation  (stored  in  a  special  record  specified  in  the  call  to  setjmp),  contains  the  val¬ 
ues  of  grl,  msp,  IrO,  and  Irl,  as  they  appear  when  the  call  to  setjmp  was  made. 

Interrupt  handler  code  must  make  provisions  for  a  User  mode  handler  to  call  the 
longjmp  function  from  within  the  code.  During  the  course  of  executing  the  longjmp, 
the  values  stored  in  the  marker  record  are  loaded  into  their  respective  registers.  The 
processor  is  executing  in  User  mode,  with  interrupts  enabled,  so  this  process  might  be 
interrupted  at  any  point.  The  interrupt  handler  code  that  recognizes  stack  inconsis¬ 
tencies  (presented  earlier)  handles  this  case  by  fixing  up  the  management  registers,  to 
establish  a  consistent  stack.  When  the  interrupt  handler  returns,  the  longjmp  will  be 
properly  completed. 

Not  all  User  mode  signal  handlers  will  have  to  contend  with  the  complexities 
introduced  by  setjmp  and  longjmp  function  calls.  In  this  case,  the  code  presented 
earlier  can  be  somewhat  simplified;  however,  because  the  amount  of  code  devoted  to 
this  potential  situation  is  very  small,  it  is  recommended  that  users  provide  the  addi¬ 
tional  checks  and  compensating  code. 
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(b)  Long-Jump  to  a  near  Setjmp 
Figure  4-12.  Long-Jump  to  Setjmp 
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Chapter  5 


Operating  System  Issues 


Because  application  programs  make  use  of  operating  system  services,  the 
overhead  costs  associated  with  typically  requested  services  is  of  great  interest.  With 
the  performance  levels  offered  by  the  best  RISC  implementations,  these  overhead 
costs  have  become  very  low.  However,  the  often  increased  complexity  of  RISC 
operating  systems  has  lead  to  some  confusion  about  the  efficiency  of  operating 
system  implementations. 

This  chapter  discusses  in  detail  the  various  forms  of  context  switching  which 
occur  between  operating  system  and  application  code.  This  particular  task  is  one  of 
the  more  complex  functions  supported  by  a  typical  operating  system.  Also  discussed 
are  general  issues  related  to  context  switching.  The  large  number  of  registers 
available  to  application  programs  may  initially  suggest  that  the  29K  is  not  ideal  at 
performing  application  context  switching.  However,  there  are  a  number  of 
optimizations  which,  when  applied,  greatly  reduce  context  save  and  restore  times 
[Mann  1992a]. 

The  code  examples  shown  make  use  of  a  number  macros  for  pushing  and 
popping  special  registers  to  an  external  memory  stack.  These  macros  were  presented 
in  section  3.3.1,  Assembly  Programming. 

Within  this  chapter,  context  information  will  be  frequently  stored  and  reloaded 
from  a  per-task  data  region  known  as  the  Process  Control  Block  (PCB ) .  An  operating 
system  register  in  the  range  of  ksl-ksl2  is  assumed  to  point  within  the  PCB  stack. 
The  example  code  assumes  that  the  relevant  register  known  as  pcb  has  already  been 
assigned  the  correct  memory  address  value  by  operating  system  specific  code.  The 
example  code  also  uses  constants  of  the  form  CTX_CHC.  These  are  offsets  from  the 
top  of  the  PCB  stack  (lower  address)  to  the  relevant  address  containing  the  desired 
register  information  (the  CHC  register  in  the  example).  When  a  memory  stack  is  used 
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to  save  the  context  in  place  of  the  PCB  data  structure,  the  CTX_  offset  constants  may 
still  be  used. 

5.1  REGISTER  CONTEXT 

Part  of  the  increased  performance  of  the  29K  family  comes  from  using  128 
internal  registers  as  a  register  stack  cache.  The  cache  holds  the  top  of  the  run-time 
stack.  Each  procedure  obtains  its  necessary  register  allocation  by  claiming  a  region 
of  the  register  stack.  The  register  cache  does  not  have  to  be  flushed  (spilled)  until 
there  is  insufficient  unallocated  register  space.  This  happens  infrequently.  The 
register  stack  offers  greater  performance  benefits  over  a  data  memory  cache,  due  to 
register  cache  triple  porting  on-chip  (two  read  ports  and  one  write  port).  Note,  the 
Am29050  has  an  additional  write  port  which  can  be  used  to  simultaneously 
write-back  a  result  from  the  floating-point  unit.  Chapter  2  explains  in  detail  the 
procedure  calling  mechanism’s  use  of  the  cache. 

However,  when  a  context  switch  is  required  from  one  user  task  to  another  user 
task,  it  is  necessary  to  copy  all  internal  registers  currently  allocated  to  the  current  user 
task  to  a  data  memory  save  region.  This  makes  the  registers  available  for  use  by  the 
in-coming  task. 

In  performing  a  context  switch,  a  clear  understanding  is  required  of  processor 
register  usage.  The  AMD  C  Language  register  usage  convention  (see  section  2.1) 
makes  33  of  the  65  global  registers  (grl,  gr96-grl27)  available  for  User  data  storage. 
Global  registers  grl28-gr255,  used  to  implement  the  local  register  stack,  are  also 
used  by  the  compiler  generated  code.  (See  section  3.3  (page  117)  of  Chapter  3, 
Assembly  Language  Programming,  for  global  register  assignment.) 

Processor  global  registers  gr64-gr95  are  not  accessed  by  C  generated  code. 
These  registers  are  normally  used  by  the  Supervisor  to  store  operating  system 
information  or  implement  interrupt  handler  temporary  working  space.  Particular 
Supervisor  implementations  may  store  data  in  registers  gr64-gr95.  This  data  is 
relevant  to  the  task  currently  executing,  and  includes  such  information  as  pointers  to 
memory  resident  data  structures  containing  system  support  information.  This  data 
may  also  have  to  be  copied-out  to  memory  when  a  task  switch  is  required. 

The  C  procedure  calling  convention  specifies  that  global  registers  gr96-grlll 
are  used  for  return  value  passing.  For  a  procedure  returning  a  32-bit  integer,  only 
register  gr96  is  required  to  store  return  value  information.  The  compiler  generally 
uses  these  global  registers  for  temporary  working  space  before  the  return  value  data 
is  determined.  The  compiler  has  nine  more  temporary  registers  in  the  grll2-grl27 
range  which  can  also  be  used  for  temporary  data  storage .  Other  registers  in  this  range 
are  used  to  implement  register  stack  support  functions. 

When  more  registers  are  required  by  a  procedure  for  data  storage,  the  local 
register  stack  can  be  used.  This  reduces  the  need  to  use  external  data  memory  to  store 
procedure  data. 
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The  prologue  of  each  procedure  lowers  the  register  stack  pointer  ( grl )  by  the 
amount  necessary  to  allocate  space  for  a  procedure’s  in-coming  and  out-going 
parameters.  The  prologue  code  is  generated  by  the  compiler,  and  can  thus  lower  the 
stack  pointer  by  an  additional  amount  to  make  temporary  registers  available  to  the 
procedure.  The  compiler  is  more  likely  to  do  this  when  the  “-0”  optimization  switch 
is  used  and  the  procedure  has  an  unusually  large  register  requirement. 

Each  29K  processor  reserves  global  register  grl  to  implement  a  register  stack 


memory  stack  cache 


rfb 


rsp 


rab 


lower  memory  address 
Figure  5-1.  A  Consistent  Register  Stack  Cache 

pointer,  which  points  to  the  base  of  the  current  procedure  register  allocation 
(activation  record)  (see  Figure  5-1).  Register  grl  points  to  the  first  local  register 
allocated  to  the  procedure,  known  as  IrO.  Local  register  Irl,  located  in  the  register 
cache  at  location  [gri]+4,  is  the  second  local  register  available  to  the  procedure.  The 
C  calling  convention  rules  state  that  this  register  is  reserved  for  pointing  to  the  top  of 
the  procedure  activation  record.  The  Irl  register,  known  as  the  frame  pointer,  points 
to  the  first  register  above  the  register  group  allocated  to  the  current  procedure  (see 
Figure  5-2).  The  frame  pointer  is  used  during  register  stack  filling  (cache  filling) 
when  it  must  be  determined  if  the  registers  allocated  to  the  current  procedure  are 
located  in  the  register  stack  and  not  spilled-out  (flushed  out)  to  external  data 
memory. 
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Figure  5-2.  Current  Procedures  Activation  Record 

A  leaf  procedure  is  defined  as  one  that  does  not  call  any  other  procedure. 
Because  leaf  procedures  have  no  out-going  parameters  (data  passed  to  called 
functions),  they  do  not  have  to  lower  the  register  stack  pointer  and  create  an 
activation  record.  It  is  likely  they  have  sufficient  temporary  working  space  in  the  25 
global  registers  available  to  each  procedure.  Of  course,  when  one  procedure  calls 
another  it  must  assume  the  called  procedure  will  use  available  global  registers,  and 
thus  store  critical  data  to  local  register  locations  or  external  data  memory.  However,  a 
particularly  large  leaf  procedure  may  allocate  an  activation  record  to  gain  access  to 
additional  local  register  storage.  Leaf  procedures  which  do  this  obtain  a  new  Irl 
register  that  need  not  be  used  to  point  to  the  top  of  the  activation  record  (because 
leaves  do  not  call  other  procedures).  In  this  case,  a  leaf  procedure  is  free  to  use  local 
register  Irl  as  additional  temporary  storage. 

It  is  interesting  to  note,  a  performance  gain  is  achieved  by  some  C  compilers  by 
breaking  a  previously  listed  rule.  That  is,  a  calling  procedure  need  not  always  assume 
the  called  procedure  will  use  all  25  global  registers.  If  the  called  procedure  is  defined 
before  calls  are  made  to  it,  the  compiler  can  determine  its  register  usage.  This  enables 
the  compiler  to  only  issue  code  to  save  the  global  registers  effected  by  the  callee, 
rather  than  preserve  all  global  registers  which  are  in  use  at  the  time  of  the  call. 

5.2  SYNCHRONOUS  CONTEXT  SWITCH 

The  discussion  in  the  Register  Context  section  is  not  a  complete  introduction  to 
the  register  stack  mechanism  required  to  support  C  procedures  executing  on  a  29K 
processor  (see  Chapter  2).  However,  the  information  is  required  to  understand  the 
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process  of  a  synchronous  context  switch.  In  a  synchronous  context  switch,  the 
currently  executing  user  task  voluntarily  gives  up  the  processor  to  enable  another 
task  to  start  execution.  This  is  normally  done  via  a  system  call.  Because  of  the  C 
calling  rules,  the  procedure  which  makes  the  system  call  cannot  itself  be  a  leaf 
function.  This  means  that  the  Irl  value  of  the  procedure  making  the  system  call 
always  contains  a  valid  pointer  to  the  top  of  the  current  activation  record.  If  the 
library  routine  implementing  the  system  call  does  not  lower  the  register  stack  (in 
practical  terms  —  it  is  a  small  leaf  procedure),  the  current  Irl  value  is  a  valid  pointer 
to  the  top  of  the  activation  record. 

At  first  glance  it  seems  the  large  number  of  internal  registers  must  result  in  an 
expensive  context  save  and  restore  time  penalty.  Further  study  shows  that  this  is  not 
the  case. 

Much  of  the  time  required  to  complete  a  context  switch  is  associated  with 
moving  data  between  external  memory  and  internal  registers.  However,  a  significant 
portion  of  the  time  is  associated  with  supervisor  overhead  activities. 

When  saving  the  context  of  the  current  process  all  the  registers  holding  data 
relevant  to  the  current  task  must  have  their  contents  copied  to  the  external  data 
memory  save  area. 

A  29K  processor  contains  a  number  of  special  purpose  registers.  There  are  eight 
user  task  accessible  special  registers,  srl28-srl35,  used  to  support  certain 
instruction  type  execution.  Assuming  the  exiting-task  (the  one  that  is  being  saved) 
was  written  in  C  and  the  system  call  library  code  does  not  contain  any  explicit 
move-to-special-register  instructions,  there  is  no  need  to  save  the  registers  as  any 
instructions  requiring  the  support  of  special  registers  would  have  completed  by  the 
time  of  the  context  switch  system  call.  The  AMD  C  calling  convention  does  not 
support  preservation  of  these  special  registers  across  a  procedure  call. 

Of  the  15  supervisor-mode  only  accessible  special  registers  ( sr0-srl4 ),  three 
registers  are  allocated  to  controlling  access  to  external  data  memory  (the  channel 
registers).  Because  at  the  time  of  a  synchronous  context  switch  there  is  no 
outstanding  data  memory  access  activity,  these  registers  also  need  not  be  saved.  This 
is  only  true  if  an  instruction  causing  a  trap  is  used  to  issue  the  system  call  and  there  is 
no  outstanding  data  memory  access  DERR  pending.  The  Am29000  processor 
serializes  (completes)  all  channel  activity  before  trap  handler  code  commences.  For 
more  detail  on  the  DERR  pending  issue,  see  the  Optimization  section  which  follows. 

On  entering  the  system  call  procedure,  the  25  global  registers  used  by  the  calling 
procedure  no  longer  contain  essential  data.  This  means  that  they  need  not  be  saved. 
The  register  stack  support  registers  and  the  relevant  global  supervisor  registers  must 
be  saved. 

Additionally  four  global  registers  ( grll2-grll5 )  reserved  for  the  user  (not 
affected  by  the  compiler)  must  be  saved  if  any  application  program  uses  them.  If 
these  registers  are  not  being  used  on  a  per-User  basis,  but  shared  between  all  Users 
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and  the  Supervisor  code,  then  they  need  not  be  saved.  For  example,  a  real-time 
system  may  chose  to  place  peripheral  status  information  in  these  registers  for  users  to 
examine.  The  status  information  may  be  updated  by  Supervisor  mode  interrupt 
handlers. 

The  context  information  is  stored  in  a  per-task  data  region  known  as  the  Process 
Control  Block  (PCB).  The  example  task  context  save  code  below  assume  the 
register  pointing  to  the  PCB  data  region,  pcb,  has  already  been  assigned  the  correct 
memory  address  starting  value. 

An  operating  system  register  in  the  range  ksl-ksl2  is  assumed  to  point  to  the 
bottom  of  the  PCB  stack.  Note  that  the  CPS  register  bits  set  by  the  MTSRIM 
instruction  are  system  dependent;  the  RE  bit  may  be  required  in  some  cases  and  the 
IM  field  value  is  system  dependent, 


.  equ 

SIG_SYNC,  -1 

; indicate  a  synchronous  save 

.  equ 

ENABLE, (SM| PD | PI) 

.  equ 

DISABLE, (ENABLE | DI | DA) 

.  equ 

FPStatO, gr93 

; floating-point 

.  equ 

FPStatl, gr94 

;trapware  support 

.  equ 

FPStat2 , gr 95 

; registers 

save : 

/example  synchronous  context  save 

constn 

itO, SIG_SYNC 

push 

pcb, itO 

push 

pcb, gr 1 

push 

pcb, rab 

/push  rab 

pushsr 

pcb, itO, pcO 

/push  specials 

pushsr 

pcb, itO , pci 

sub 

pcb, pcb, 1*4 

; space  pc2 

pushsr 

pcb, itO 

/push  CHC 

pushsr 

pcb, itO 

/push  CHD 

pushsr 

pcb, itO 

/push  CHC 

sub 

pcb, pcb, 1*4 

; space  for  alu 

pushsr 

pcb, ops 

/push  OPS 

mtsrim 

cps, DISABLE 

/remove  freeze 

sub 

pcb, pcb, 1*4 

/ space  for  tav 

mtsrim 

chc,  0 

/possible  DERR 

push 

pcb, lrl 

/push  R-stack 

push 

pcb, rfb 

/  support 

push 

pcb, msp 

/push  M-stack  pnt . 

mtsrim 

cps , ENABLE 

/enable  interrupts 

push 

pcb, FPStatO 

/ floating  point 

push 

pcb, FPStatl 

push 

pcb, FPStat2 

sub 

pcb, pcb, 4*4 

/space  for  IPA..Q 

sub 

pcb, pcb, 9*4 

/space  grll6-124 

sub 

pcb, pcb, 4*4 

/push  grll2-115,  optional 
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; space  for  gr96-lll 


mtsrim  cr,  4-1 
storem  0 , 0 , grll2 , pcb 
sub  pcb, pcb, 16*4 

Local  registers  currently  in  use,  those  that  lie  in  the  region  pointed  to  by  grl  and 
rfb  ( grl27 ),  require  saving.  Not  all  of  the  local  register  cache  needs  saving.  The 
example  code  below  assumes  the  user  was  running  with  address  translation  on.  Thus, 
to  gain  access  to  the  user’s  register  stack,  the  Supervisor  must  use  the  UA  option  bit 
when  storing  out  the  cache  contents.  If  the  user  had  been  mnning  in  physical  address 
mode,  then  there  is  no  need  for  the  Supervisor  to  use  the  UA  option  to  temporarily 
obtain  User  mode  access  permissions. 

The  context  save  code  example  above,  operates  with  physical  addresses  in 
Supervisor  mode.  This  means  address  translation  is  not  enabled.  To  enable  data 
address  translation  when  the  UA  bit  is  use,  the  PD  bit  in  the  CPS  register  must  be 
cleared.  Some  operating  system  developers  may  choose  to  run  the  Supervisor  mode 
code  with  address  translation  turned  on;  in  such  cases,  the  PD  bit  will  already  be 
cleared.  Remember,  once  the  PD  bit  is  reset,  it  is  possible  to  take  a  TLB  miss.  With  the 
UA  bit  set  during  the  cache  store  operation,  the  TLB  miss  will  relate  to  the  temporary 
User  mode  data  memory  access. 


.  equ 

UA, 0x08 

;UA  access 

.  equ 

PD, 0x40 

; PD  bit 

mtsrim 

cps, ENABLE&-PD 

; virtual  data 

sub 

ktO,  rfb, grl 

; get  bytes  in  cache 

srl 

ktO, ktO, 2 

; adjust  to  words 

sub 

ktO, ktO, 1 

mtsr 

cr, ktO 

storem 

0 ,  UA,  IrO , grl 

; save  IrO-rfb 

mtsrim 

cps ,  ENABLE 

; return  to  physical 

5.2.1  Optimizations 

When  an  ASSERT  instruction  is  used  to  enter  Supervisor  mode,  all  outstanding 
data  memory  access  activity  is  completed  before  the  trap  handler  gains  control.  If  no 
data  access  error  (DERR)  occurs  then  the  channel  registers  will  contain  no  valid  data 
and  need  not  be  saved.  However,  when  the  channel  access  is  serialized  and  forced  to 
complete,  a  priority  four  DERR  may  have  occurred.  The  DERR  trap  competes  with 
the  priority  three  system  call  trap  (higher  than  four),  and  thus  the  system  call  trap 
handler  commences  but  with  the  channel  still  containing  information  pertaining  to 
the  failed  data  access. 

A  performance  gain  can  be  obtained  by  not  saving  the  channel  registers  to 
external  data  memory.  If  the  memory  system  hardware  is  unable  to  generate  the 
DERR  signal,  then  the  channel  registers  should  not  be  saved.  Additionally,  if  the 
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software  developer  knows  the  previous  data  memory  access  has  been  completed  or 
was  to  a  known  memory  location,  there  may  be  no  need  to  save  the  channel  registers. 
The  code  shown  below  is  an  alternative  to  the  previous  system  call  trap  handler  entry 
code,  the  transaction-fault  bit  (TF)  in  the  channel  control  register  (CHC)  is  tested  to 
determine  if  channel  registers  need  saving. 

A  further  performance  gain  can  be  obtained  by  not  saving  the  PCO  register. 
When  the  PC  1  register  is  restored,  the  PCO  register  can  be  determined  by  adding  4  to 
the  PCI  address  value.  To  achieve  the  best  performance  gains,  the  code  in  the 
subsequent  Restoring  Context  section  may  be  optimized  to  avoid  restoring  channel 
registers  CHA  and  CHD  if  the  CHC  contents-valid  (CV)  bit  is  zero. 


save_channel : 

; deal  with 

pushsr 

pcb, itO, cha  , 

;  DERR  fault 

pushsr 

pcb, itO, chd 

pushsr 

pcb, itO, chc 

jmp 

channel_saved 

mtsrim 

chc, 0  , 

;  clear  TF 

sync_save : 

;  example  synchronous 

constn 

itO, SIG_SYNC 

push 

pcb, itO 

push 

pcb, grl 

push 

pcb, rab  , 

;push  rab 

sub 

pcb, pcb, 1*4  , 

;  space  for  pcO 

pushsr 

pcb, itO, pci  , 

;push  pci 

sub 

pcb, pcb, 1*4  , 

;  space  for  pc2 

mf  sr 

itO, chc  i 

;test  TF  bit 

sll 

itO, itO, 31-10  , 

;  in  CHC  set 

jmpt 

itO, save_channel 

sub 

pcb, pcb, 2*4  , 

;  space  for  cha, chd 

const 

itO,  0 

push 

pcb, itO  i 

;push  CHC=0 

channel_saved : 

sub 

pcb, pcb, 1*4  , 

;  space  for  alu 

pushsr 

pcb, ops  ; 

;push  OPS 

mtsrim 

cps, DISABLE  , 

;  remove  freeze 

sub 

pcb, pcb, 1*4  , 

;  space  for  tav 

When  restoring  the  task  currently  being  saved,  it  is  not  necessary  to  reload  all 
128  local  registers,  or  even  the  part  of  the  register  file  in  use  at  context  save  time 
([grl  ] — [  rfb  \ ) .  Only  the  activation  record  of  the  last  executing  procedure  for  the  task 
([ grl ] — \lrl ])  (see  Figure  5-3).  This  greatly  reduces  the  time  required  to  restore  a 
task  context  originally  saved  by  a  synchronous  context  switch.  Typically  the  size  of  a 

procedure  activation  record  ([gri] - \lrl  ])  is  twelve  words.  To  achieve  this 

optimization,  the  push  of  rab  and  rfb,  shown  in  the  previous  code  fragment,  must  be 
changed  to  the  code  shown  below.  This  ensures  only  one  activation  record  is  restored. 
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higher  addresses 


Memory  resident 


Ir6 

Ir5 

Ir4 

Ir3 

Ir2 

Irl 

IrO 


Cache  resident 

rfb 


frame  pointer 

in-coming  pram 

frame  pointer 

return  address 

local 

out-going  pram 

frame  pointer 

frame  pointer 

—  Activation  record 
only  partially  in  cache 


top  of  activation 
record 


top  of  activation 
record 


Activation  record 
of  last  executing 
procedure 


■+ —  grl  top  of  stack 


Figure  5-3.  Overlapping  Activation  Records  Eventual  Spill  Out  of  the  Register  Stack  Cache 


.  equ 

WS, 512 

; Window  Size 

const 

itO, WS 

; replacement  for 

sub 

rab, Irl, itO 

;  push  rab 

push 

pcb, rab 

;push  lrl-512 

push 

pcb, Irl 

; replacement  for  push  rfb 
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Burst  mode  enables  data  to  be  loaded-from  or  stored-to  memory  consecutively, 
without  the  processor  continuously  supplying  addresses  information.  An  external 
address  latch/counter  is  required  to  support  such  memory  systems  with  an  Am29000 


Interrupt  Frame 


System  Call  Frame 


the  stack  is  shown  with 
higher  addresses  at  the 
top  of  the  figure,  and  lower 
addresses  at  the  bottom. 


enable  interrupts 


stack 

grows 

down 


* 


registers  not 
normaly  saved 


signal  number 

signal  number 

grl 

grl 

rab 

rab 

PCO 

PCO 

PCI 

PCI 

PC2 

PC2 

CHA 

CHA 

CHD 

CHD 

CHC 

CHC=0 

ALU 

ALU 

OPS 

OPS 

tav 

tav 

Irl 

Irl 

rfb 

rfb 

msp 

-► 

msp 

floating  point  support 

floating  point  support 

IPA 

IPA 

IPB 

IPB 

IPC 

IPC 

Q 

Q 

grl  1 6— grl  24 

grl  1 6— grl  24 

grl  1 2— grl  1 5 

grl  1 2— grl  1 5 

gr96— grl  1 1 

gr96— grl  1 1 

Async.  Save  Frame  Sync.  Save  Frame 


Figure  5-4.  Context  Save  PCB  Layout 
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or  Am29050  processor.  A  system  designer  can  use  this  feature  to  reduced  context 
switch  times. 

5.3  ASYNCHRONOUS  CONTEXT  SWITCH 

An  asynchronous  context  switch  occurs  when  the  current  task  unexpectedly 
gives  up  the  processor  to  enable  another  task  to  execute.  This  may  occur  when  a  timer 
interrupt  results  in  the  supervisor  deciding  the  current  task  is  no  longer  the  task  of 
highest  priority.  Unlike  at  the  point  at  which  a  synchronous  context  switch  occurs, 
when  an  interrupt  occurs  the  state  of  the  processor  is  not  restricted  to  a  simple  state. 

Because  an  interrupt  may  occur  in  a  leaf  procedure,  it  is  not  possible  to 
determine  if  the  current  Irl  value  contains  a  valid  pointer  to  the  top  of  the  procedure 
activation  record.  Further,  the  interrupt  may  have  occurred  during  a  procedure 
prologue,  where  the  register  stack  pointer  ( grl )  has  been  lowered  but  the  Irl  value 
has  not  yet  been  updated.  This  means  when  an  asynchronous-saved  task  is  switched 
back  in,  it  is  impossible  to  restore  only  the  activation  record  of  the  interrupted 
procedure.  The  register  stack  containing  valid  data,  that  is  [grl] — [rfb],  must  be 
restored.  Assuming  this  amounts  to  half  of  the  register  file,  an  additional  2.6  micro 
seconds  would  be  required  to  restore  the  task  with  a  single-cycle  Am29000 
processor  memory  system  at  25MHz. 

A  task  voluntarily  giving  up  the  processor  via  a  system  call  from  within  a 
procedure  of  typical  activation  record  size  can  be  restored  faster  then  a  task  giving  up 
the  processor  involuntarily  via  an  asynchronous  interrupt. 

When  a  User  mode  program  is  interrupted  it  could  mean  the  current  process  is  to 
be  sent  a  signal,  such  as  a  segmentation  violation.  It  could  also  mean  that  the 
Supervisor  wishes  to  gain  control  of  the  processor  to  support  servicing  the 
interrupting  device.  If  the  current  process  is  being  signaled,  the  label  user_signal 
should  be  jumped  to  by  Supervisor  mode  interrupt  handler  (see  the  example  code 
below).  This  is  explained  in  the  later  section  titled  User  Mode  Signals  ( section  5 .5).  If 
Supervisor  support  code  is  required  for  peripheral  device  servicing,  then  the  action  to 
be  taken  is  very  much  dependent  on  the  interrupting  device  needs. 

. equ  SIGALRM, 14  ; alarm  signal 

time_out:  ; timer  interrupt  handler 

jmp  interrupt_common 

const  itO,  SIGALRM  ; signal  number 

interrupt_common : 

; Depending  on  required  processing, 

; jump  to  user_signal  for  current  process  signaling. 

;0r,  jump  to  user_interrupt  to  save  the  current  process  context. 

Some  interrupts  can  be  serviced  in  Freeze  mode,  without  the  need  to  save  the 
current  process  context.  Use  of  these  so-called  lightweight  interrupt  handlers  can 
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offer  significant  performance  gains.  Other  interrupts  will  require  the  interrupted 
process  context  to  be  saved.  This  is  described  in  the  following  section,  Interrupting 
User  Mode  (section  5.4). 

It  is  possible  an  interrupt  has  arrived  that  requires  a  signal  to  be  sent  to  a  process 
which  is  not  the  currently  executing  process.  In  this  case,  the  operating  system  must 
first  save  the  current  process  context  and  then  restore  the  context  of  the  signaled 
process.  Once  the  in-coming  process  is  prepared  to  run,  using  the  code  in  the  Restore 
Context  section  (section  5.10),  the  restored  context  will  have  to  be  then  placed  on  the 
signal  stack  as  described  in  the  User  Mode  Signals  (section  5.5).  Thus,  execution 
would  begin  in  the  User  mode  trampoline  code  of  the  in-coming  process.  To  follow 
this  in  detail,  later  sections  of  the  chapter  shall  have  to  be  studied. 

5.4  INTERRUPTING  USER  MODE 

This  section  describes  how  the  operating  system  can  prepare  the  processor  to 
execute  a  C  level  interrupt  handler,  where  the  handler  is  to  run  in  Supervisor  mode 
and  the  interrupt  occurred  during  User  mode  code  execution. 

Because  the  User  mode  task  is  being  asynchronously  interrupted,  the  complete 
processor  state  must  be  saved.  The  context  information  should  be  stored  in  the  PCB 
rather  than  a  temporary  stack,  as  a  context  switch  to  a  new  user  task  may  occur  after 
the  interrupt  has  been  processed.  Storing  the  state  in  the  PCB  saves  having  to  copy 
the  state  from  the  temporary  stack  to  the  PCB  after  the  context  switch  decision  has 
been  made.  When  saving  task  context,  a  performance  optimization  is  obtained  by 
only  saving  the  registers  which  are  currently  in  use.  However,  such  optimizations 
typically  only  apply  to  synchronous-task  context  saving. 

When  User  mode  is  interrupted,  the  special  purpose  support  registers  may 
contain  valid  data.  This  means  an  additional  nine  special  register  data  values  must  be 
copied  to  external  data  memory,  compared  to  the  synchronous  context  switch. 

Below  is  a  code  example  of  interrupt  context  saving.  Notice  the  rab  stack 
support  register  is  adjusted  to  a  window  distance  below  rfb  within  the  interrupt 
disabled  portion  of  the  code.  This  is  to  conform  to  the  same  PCB  format  used  by  those 
who  wish  to  perform  the  register  stack  fix-up  with  User  mode  code,  rather  than  in  the 
Supervisor  code  shown.  Register  rab  is  merely  a  convenience  value  for  determining 
Window  Size  (WindowSize  normally  512)  in  detecting  a  SPILL  condition. 
However,  it  is  also  used  to  determine  FILL  or  SPILL  interruption.  Should  the  User 
mode  stack  fix-up  code  be  interrupted  during  it’s  operation,  it  is  important  that  it  does 
not  become  confused  with  the  original  SPILL  or  FILL  interrupt.  Realigning  the  rab 
register  whilst  interrupts  are  off  prevents  this  confusion. 

. equ  WS,512  ;Window  Size 

user_interrupt :  ; saving  User  mode  context 

push  pcb, itO  ; stack  signal  id 
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push 

pcb, grl 

push 

pcb, rab 

; stack  real  rab 

const 

itO , WS 

sub 

rab, rfb, itO 

; set  rab=rfb-512 

pushsr 

pcb, itO , pcO 

/push  specials 

pushsr 

pcb, itO, pci 

pushsr 

pcb, itO, pc2 

pushsr 

pcb,  itO , cha 

pushsr 

pcb, itO, chd 

pushsr 

pcb, itO, chc 

pushsr 

pcb, itO, alu 

pushsr 

pcb, itO , ops 

mtsrim 

cps,  DISABLE 

/remove  freeze 

push 

pcb, tav 

mtsrim 

chc,  0 

/ clear  CHC 

push 

pcb, lrl 

/push  R-stack 

push 

pcb, rfb 

/  support 

push 

pcb, msp 

/M-stack  pnt . 

mtsrim 

cps , ENABLE 

/enable  interrupts 

push 

pcb, FPStatO 

/ floating  point 

push 

pcb, FPStatl 

push 

pcb, FPStat2 

pushsr 

pcb, ktO, ipa 

/more  specials 

pushsr 

pcb, ktO, ipb 

pushsr 

pcb, ktO, ipc 

pushsr 

pcb, ktO, q 

The  25  global  registers,  known  to  contain  no  valid  data  during  a  synchronous 
context  switch,  must  also  be  considered  active,  and  consequently  saved.  Because 
these  global  registers  are  located  adjacent  to  the  four  global  registers  reserved  for  the 
user,  a  single  store-multiple  instruction  can  be  used  to  save  the  relevant  global 
registers.  Considering  a  single-cycle  memory  system,  two  micro  seconds  should  be 
required  to  save  the  additional  current  task  context. 


sub  pcb,pcb,29*4  ;push  gr96-grl24 

mtsrim  cr,29-l  /including  optional  save  of 

storem  0,0,gr96,pcb  ;  grll2-grll5 

If  the  interrupt  is  expected  to  result  in  a  context  switch  then  the  local  registers 
currently  in  use  require  saving.  Note,  this  can  be  postponed  (see  the  following 
optimizations  section).  Not  all  of  the  local  register  cache  needs  to  be  saved.  However, 
as  is  explained  below,  do  not  simply  assume  that  those  that  lie  in  the  region  pointed  to 
by  grl  and  rfb  ( grl27)  are  the  only  active  cache  registers. 

When  a  synchronous  context  switch  occurs  the  register  stack  is  known  to  be  in  a 
valid  condition  (see  Figure  5-1).  With  an  asynchronous  event  causing  a  context 
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switch,  the  stack  may  not  be  in  a  valid  condition.  There  are  three  inconsistent 
situations  that  must  be  detected  and  dealt  with. 

■  The  interrupt  occurred  when  a  SPILL  was  in  progress,  in  which  case  the 
distance  between  the  values  in  the  rfb  and  mb  registers  exceeds  the  size  of  the 
local  register  file  (referred  to  as  the  Window  Size).  All  of  the  local  register  file 
must  be  saved.  Some  of  the  cached  data  may  have  already  been  copied  out  to 
memory  locations  just  below  rfb.  This  data  should  remain  at  this  location  on  the 
memory  resident  portion  of  the  stack  until  the  task  is  restarted. 

■  The  interrupt  occurred  when  a  FILL  operation  was  in  progress,  in  which  case 
the  distance  between  the  values  in  the  rfb  and  mb  registers  is  less  than  the  size  of 
the  local  register  file.  Some  data  may  have  been  copied  in  from  the  top  of  the 
memory  resident  portion  of  the  register  stack  into  local  registers  just  above  mb. 
These  registers  will  not  be  saved  during  the  normal  cache  save  (/ grl ]-[ rfb]).  To 
deal  with  this  the  FILL  must  be  restarted  when  the  context  is  restored. 

■  The  interrupt  occurred  during  a  far-longjmp  operation.  A  far-longjmp  is 
defined  as  one  in  which  the  future  ( grl  +  8)  value — which  is  the  address  of  local 
register  lr2  on  the  register  memory  stack — is  greater  than  the  current  value  in 
the  rfb  register.  In  this  case  the  local  registers  contain  no  valuable  data  because  a 
previous  activation  record  (present  during  setjmp)  is  about  to  be  restored  from 
the  memory  resident  portion  of  the  stack. 


. equ  WS, 512 

R_f ixup : 

add  ktO, pcb, CTX_RAB 

load  0,0,kt2,kt0 

sub  kt0,rfb,kt2 

srl  ktO, ktO, 2 

cpeq  ktl, ktO, WS>>2 

jmpt  ktl, norm 

cpltu  ktl, ktO, WS>>2 

jmpt  ktl, ifill 

add  ktl , grl , 8 

cpgtu  ktl, ktl, rfb 

jmpt  ktl,illjmp 

nop 

r 

ispill : 

const  ktl,WS 

jmp  norm 

sub  grl,rfb-ktl 


add  ktl, pcb, CTX_CHC 

const  ktO, 0 
push  ktl, ktO 


;  Window  Size 

; register  stack  fix-up 

; get  rab  value 
;  window  size 
; convert  to  words 
;test  for  valid 
;  stack  condition 
;test  for  FILL 
;  interrupt 
;test  far-longjmp 
;  interrupt 
;yes,  grl+8  >  rfb 


;deal  with  interrupted  SPILL 
; grl=rfb-512 


;deal  wilth  interrupted  FILL 


; resave  CHC=0 
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norm : 


add 

ktO, FillAddrReg, 

4 

add 

ktl,  pcb, CTX_PC0 

push 

ktl, ktO 

; resave  PC0,PC1 

push 

ktl,  FillAddrReg 

add 

ktl,  pcb, CTX_RAB 

push 

ktl, rab 

; resave  rab=rfb-512 

;deal  with  consistant  stack 

sub 

ktl, rfb, grl 

; bytes  in  cache 

srl 

ktl, ktl, 2 

; convert  to  words 

sub 

ktl, ktl,  1 

; adjust  for  storem 

mtsrim 

cr, ktl 

mtsrim 

cps, ENABLE&-PD 

; virtual  data 

storem 

0 ,  UA, IrO , grl 

; copy  to  stack 

mtsrim 

cps , ENABLE 

; physical  data 

illjmp:  ;valid  local  registers  now  saved 

Once  the  user’s  User  mode  register  stack  has  been  saved,  the  interrupt  handler 
continues  using  the  user’s  Supervisor  mode  register  and  memory  stacks. 


.macro  const32 , reg, data 

const  reg, data  ; zero  high,  set  low 

consth  reg, data  ;high  16-bits 

.  endra 

const32  msp, SM_STACK  ; Supervisor  M-stack 
const32  rab, SR_STACK-WS 

add  grl,rfb,8  ;prepare  Supervisor 

const32  rfb, SR_STACK  ;  R-stack  support 
add  lrl,rfb,0  ;  registers 

r 

;call  appropriate  C-level  interrupt  handler 

The  current  task  context  has  now  been  saved.  After  the  interrupt  has  been 
processed  the  operating  system  can  select  a  different  task  to  restore.  This  operation  is 
described  in  a  subsequent  section  entitled  Restoring  Context  ( section  5. 10).  The  PCB 
structure  for  the  out-going  task  shall  not  be  accessed  until  the  task  is  again  restored  as 
the  current  executing  task. 

5.4.1  Optimizations 

When  User  mode  is  interrupted,  processing  continues  using  the  user’s 
Supervisor  mode  stacks.  This  is  necessary  because  the  interrupt  may  result  in  the 
process  being  put  to  sleep  until  some  time  later  when  it  is  again  able  to  run.  When  the 
process  is  put  to  sleep,  the  process  state  is  stored  in  the  Supervisor  memory  stack, 
described  in  the  Interrupting  Supervisor  Mode  section  (section  5.6).  If  the  user’s 
User  mode  context  was  saved  on  a  shared  interrupt  stack  rather  than  the  per-process 
Supervisor  stack,  then  the  context  would  have  to  be  copied  from  the  global  interrupt 
stack  to  the  Supervisor  stack  before  a  context  switch  could  proceed. 
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The  code  shown  above  determines  the  region  of  cache  registers  currently  in  use 
and  stores  them  out  onto  the  top  of  the  user’s  User  mode  register  stack.  This  operation 
can  be  postponed.  The  interrupt  handler  will  use  the  register  cache  in  conjunction 
with  the  Supervisor  mode  register  stack.  If  the  interrupt  handler  runs  to  completion 
and  no  context  switch  occurs,  then  the  cache  need  not  be  saved.  If  a  context  switch 
does  occur  then  the  cache  will  be  saved  on  the  top  of  the  user’s  Supervisor  mode 
register  stack.  This  means  some  User  mode  data  contained  in  the  cache  may  be 
temporary  saved  on  the  the  Supervisor  stack;  however,  this  is  not  a  problem. 

The  previous  code  determines  the  region  of  the  cache  currently  in  use,  it  does 
not  bring  the  stack  into  a  valid  condition.  The  code  following  the  label  R_fixup:  in 
the  User  Mode  Signals  section  (section  5.5)  does  bring  the  stack  into  a  valid 
condition,  and  can  be  used  to  replace  the  code  shown  above.  Once  the  stack  support 
registers  are  restored  to  a  valid  state,  the  stack-cut-across  method  described  in  the 
later  User  System  Calls  section  (section  5.7)  can  be  used  to  attach  the  cache  to  the 
Supervisor  mode  stack.  By  this  method  the  storing  of  cache  data  can  be  prevented  and 
any  unused  portion  of  the  cache  is  made  immediately  available  to  the  interrupt 
service  routine. 

5.5  PROCESSING  SIGNALS  IN  USER  MODE 

Asynchronous  context  switches  often  occur  because  an  interrupt  has  occurred 
and  must  be  processed  by  a  handler  function  developed  in  C.  A  technique  often 
overlooked  in  real-time  applications  is  using  a  signal  handler  to  process  the  interrupt. 
This  often  avoids  much  of  the  supervisor  overheads  associated  with  a  context  switch. 
Additionally,  a  context  switch  requires  the  instruction  cache  to  be  flushed.  Signal 
handlers  run  in  the  context  of  the  interrupted  User  mode  process,  this  avoids  the  need 
to  flush  the  cache. 

It  is  not  necessary  to  store  the  contents  of  the  local  register  file.  After  signal 
support  code  has  fixed-up  the  stack  management  support  registers,  the  C  level 
handler  code  can  continue  to  use  the  register  stack  as  if  the  interrupted  procedure  had 
executed  a  call  to  the  handler  function.  In  as  little  as  5.5  micro  seconds  from  the  time 
of  receiving  the  interrupt,  the  Am29000  can  be  executing  the  interrupt  handler  code 
which  was  written  in  C. 

Unlike  asynchronous  context  switching,  the  interrupted  context  can  not  be 
saved  in  the  PCB .  To  do  so  would  be  convenient  if  a  context  switch  was  possible  after 
the  signal  handler  had  finished  executing.  The  PCB  structure  would  be  already 
updated.  However,  a  further  interrupt  may  occur  during  the  C  level  signal  handler 
execution,  which  may  itself  result  in  an  immediate  context  switch  and  require  the  use 
of  the  PCB  data  save  area.  Additionally,  the  signal  handler  may  do  a  longjmp  to  a 
setjmp  which  occurred  in  User  mode  code  before  the  signal  handler  started 
executing.  For  this  reason  the  context  information  is  placed  on  the  User’s  memory 
stack  pointed  to  by  msp. 
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Users  of  operating  systems  complying  with  the  AMD  HIF-specification  are 
required  to  complete  signal  handler  preparation  tasks  in  User  mode  code  supplied  in 
AMD  libraries.  HIF  compliant  operating  systems  only  save  the  signal-number 
through  the  tav  register  portion  of  the  interrupt  frame  on  the  user’s  memory  stack. 
The  remaining  part  of  the  interrupt  frame  is  saved  by  the  user’s  code.  Any  necessary 
register  stack  management  is  performed.  The  User  mode  code  is  shown  in  Appendix 
B  and  described  in  detail  in  section  4.4.  The  following  code  is  for  operating  systems 
which  save  the  complete  interrupt  frame  and  prepare  for  a  User  mode  signal  while  in 
Supervisor  mode 


.  equ  SIGILL, 4  ; illegal  operation 

.  equ  WS,512  ;Window  Size 


protect:  ; Protection  violation  trap  handler 

jmp  user_signal  ; send  interrupted  task  a  signal 

const  itO, SIGILL  ; signal  number 


If  the  interrupted  User  mode  code  was  running  with  address  translation  turned 
on,  then  the  user’s  memory  stack  must  be  accessed  by  the  Supervisor  using  the  UA  bit 
during  LOAD  and  STORE  instructions  (note,  this  is  also  true  for  the  push  and pushsr 
macros).  The  following  code  example  shows  pushing  onto  a  physically  accessible 
user  memory  stack.  If  the  user’s  stack  were  virtually  addressed,  then  the  push 
instructions  would  be  replaced  by  move  to  temporary  register  instructions.  After 
interrupts  were  enabled  the  PD  bit  in  the  CPS  register  would  be  cleared  to  enable  data 
address  translation,  and  then  the  temporary  registers  would  be  pushed  onto  the  user’s 
memory  stack  using  the  UA  bit  during  the  STORE  instruction  operation.  Once  the 
frozen  special  registers  had  been  saved,  via  the  use  of  temporary  registers,  the 
Supervisor  could  continue  to  run  with  the  CPS  register  bits  PD  and  DA  cleared,  and 
store  the  remaining  user  state  via  push  operations. 


user_signal : 


push 

msp. 

itO 

push 

msp. 

gr  1 

const 

it  0, 

WS 

sub 

rab. 

rfb, itO 

pushsr 

msp. 

itO , pcO 

pushsr 

msp. 

itO , pci 

pushsr 

msp. 

itO , pc2 

pushsr 

msp. 

itO , cha 

pushsr 

msp. 

itO , chd 

pushsr 

msp. 

itO, chc 

pushsr 

msp. 

itO , alu 

pushsr 

msp. 

itO , ops 

mtsrim 

cps. 

DISABLE 

push 

msp. 

tav 

mtsrim 

chc. 

0 

/prepare  to  process  a  signal 
; stack  signal  id 

;  set  rab=rfb-512 
/push  specials 


/remove  freeze 
/ clear  CHC 
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push 

msp, lrl 

;push  R-stack 

push 

msp, rfb 

;  support 

push 

msp, msp 

;M-stack  support 

r 

mtsrim 

cps , ENABLE 

; enable  interrupts 

r 

push 

msp, FPStatO 

; floating  point 

push 

msp, FPStatl 

push 

msp, FPStat2 

t 

pushsr 

msp, ktO, ipa 

;more  specials 

pushsr 

msp, ktO, ipb 

pushsr 

msp, ktO , ipc 

pushsr 

msp, ktO, q 

r 

sub 

msp, msp, 29*4 

;push  gr96-grl24 

mtsrim 

cr , 2  9-1 

; including  optional  save  of 

storem 

0,0,  gr 96, msp 

;  grll2-grll5 

The  register  stack  must  now  be  brought  into  a  valid  condition,  if  is  not  already  in 
a  valid  condition.  Valid  is  defined  as  consistent  with  the  conditions  supporting  a 
function  call  prologue.  As  described  in  the  previous  section  5.3,  Asynchronous 
Context  Switching ,  the  stack  may  not  be  valid  if  a  SPILL,  FILL  or  far-longjmp  is 
interrupted. 

Unlike  the  asynchronous  context  save  case,  with  signal  processing  our  intention 
is  not  to  simply  determine  the  active  local  registers  for  saving  on  the  user’s  memory 
portion  of  the  register  stack,  but  to  enable  the  user  to  continue  making  function  calls 
with  the  existing  stack.  That  is,  the  C  language  signal  handler  will  appear  to  have 
been  called  in  the  normal  manner,  rather  than  as  a  result  of  an  interrupt. 


;  Register  stack  fixup 

R_f ixup : 


const 

ktO, WS 

; WindowSize 

add 

kt2 , msp, CTX_RAB 

load 

0,0, kt2 , kt2 

; interrupted  rab 

sub 

ktl, rfb, kt2 

; determine  if 

cpgeu 

ktl, ktl, ktO 

; rfb-rab>=WindowSize 

jmpt 

ktl, nf ill 

; jmp  if  spill 
;or  valid  stack 

add 

ktl, grl, 8 

; check  if 

cpgtu 

ktl , ktl , rfb 

;  grl+8  >  rfb 

jmpt 

nop 

ktl, nf ill 

;yes,  long-long jmp 

r 

if  ill : 

;here  for  interrupted  FILL  restart 

add 

ktl, msp, CTX_CHC 

const 

ktO,  0 

push 

ktl, ktO 

; resave  CHC=0 

add 

ktO, FillAddrReg, 4 

add 

ktl, msp, CTX_PC0 

push 

ktl, ktO 

; resave  PC0,PC1 

push 

ktl, FillAddrReg 
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add 

ktl, msp, CTX_RAB 

push 

ktl , rab  ; 

resave  rab=rfb-512 

t 

nfill ;  ; move  < 

grl  into  valid  range 

cpgtu 

ktO,  grl,  rfb  ; 

if  grl  >  rfb 

jmpt 

ktO,  lower  ; 

far-longjmp  case 

cpltu 

ktO,  grl,  rab  ; 

if  grl  <  rab  then 

jmpf 

ktO,  sendsig  ; 

interrupted  spill 

nop 

raise : 

add 

grl,  rab,  0 

jmp 

sendsig 

nop 

lower : 

add 

grl,  rfb,  0 

jmp 

sendsig 

nop 

Now  use  the  signal  number  to  determine  the  address  of  the  corresponding  signal 

handler.  The  code  below  assumes  there  is  an  array  of  signal  handlers.  The  first  entry 

of  the  array  is  held  at  memory  address  SigArray. 

sendsig : 

; prepare  to  leave  Supervisor  mode 

add 

ktO, msp, CTX_S IGNUMB 

load 

0,0, gr96, ktO  ; 

get  signal  numb. 

sub 

kt2 , gr96, 1  ; 

handler  index. . . 

sll 

kt2 , kt2 , 2  ; 

...  in  words 

const 

ktl , SigArray 

consth 

ktl, SigArray 

add 

kt2 , kt2 , ktl 

load 

0 , 0 , gr 97 , kt2  ; 

handler  adds. 

r 

mtsrim 

cps, FREEZE  ; 

enter  Freeze  mode 

const 

ktl,_trampoline 

add 

ktO, ktl, 4 

mtsr 

pci, ktl  ; 

return  to  user 

mtsr 

pcO, ktO  ; 

and  process  signal 

iret 

Via  an  IRET,  execution  is  continued  in  User  mode  procedure  trampoline.  This 
procedure  is  often  located  in  the  memory  page  containing  the  PCB  structure.  Using 
User  accessible  global  registers  gr96  and  gr97,  two  parameters,  the  signal  number 
and  a  pointer  to  the  signal  handler  routine,  are  passed  to  the  trampoline  code.  The 
handler  routine  is  called,  passing  to  it  the  signal  number  and  a  pointer  to  the  saved 
context. 


;User  mode  entry  to  signal  handler 


_trampoline : 
sub 
asgeu 
add 
add 


grl, grl, 6*4 
V_SPILL,  grl,  rab 
lrl, grl, 6*4 
IrO , gr 97 , 0 


; Dummy  Call 
; space  for  C-call 


; copy  handler () 
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add  lr2 , gr96, 0 

add  lr3 , msp, 0 

calli  IrO , IrO 

nop 


; copy  signal  # 
;pass  CTX  pointer 

; call  handler ( ) 


add  grl,grl,6*4  ; restore  stack 

nop 

asleu  V_FILL, lr 1 , rfb 

const  tav, SYS_SIGRETURN 

asneq  V_SYSCALL, grl , grl  ; system  call 


After  the  signal  handler  returns,  the  interrupted  context  is  restored  via  the 
sigreturn  system  call.  The  supervisor  mode  code  used  to  implement  the  restoration 
process  is  shown  in  the  section  titled  Restoring  Context  (section  5. 10).  At  the  time  of 
the  system  call  trap,  the  memory  stack  pointer,  msp,  must  be  pointing  to  the  structure 
containing  the  saved  context.  The  system  call  code  checks  relevant  register  data  to 
ensure  that  the  User  is  not  trying  to  gain  Supervisor  access  permissions  as  a  result  of 
manipulating  the  context  information  during  the  signal  handler  execution.  (Note,  it  is 
likely  that  assembly  code  library  supporting  the  sigreturn  system  call  shall  copy  the 
lr2  parameter  value  to  the  msp  register  before  issuing  the  system  call  trap.) 


5.6  INTERRUPTING  SUPERVISOR  MODE 

A  user  program  may  be  in  the  process  of  executing  a  system  call  when  an 
interrupt  occurs.  This  interrupt  may  require  C  level  handler  processing.  In  some 
respects  this  is  similar  to  a  user  program  dealing  with  a  C  level  signal  handler; 
however,  there  are  some  important  differences.  A  User  mode  signal  handlers  may 
chose  not  to  run  to  completion  by  doing  a  longjmp  out  of  the  signal  handler.  Also, 
signal  handlers  process  User  mode  data.  Supervisor  mode  interrupt  handlers  always 
run  to  completion  and  process  data  relevant  to  the  Supervisor’s  support  task  rather 
than  the  current  User  mode  task. 

Because  a  user  task  is  being  interrupted  whilst  operating  in  Supervisor  mode, 
the  complete  processor  state  must  be  saved  in  a  similar  way  to  an  asynchronous 
context  switch.  The  context  information  can  not  be  stored  in  the  current  user’s  PCB 
because  it  is  used  to  hold  the  User  mode  status  when  Supervisor  mode  is  entered  via  a 
system  call. 

User  programs  usually  switch  stacks  when  executing  system  calls  (see  section 
5.7).  The  user’s  system  stack  is  not  accessible  to  the  User  mode  program.  This  keeps 
Supervisor  information  that  appears  on  the  stack  during  system  call  execution  hidden 
from  the  user.  The  user’s  system  stack  can  be  used  to  support  C  function  calls  during 
interrupt  handler  processing.  Alternatively,  an  interrupt  processing  stack  can  be 
used.  Keeping  a  separate  interrupt  stack  for  Supervisor  mode  interrupt  processing 
enables  a  smaller  system  mode  User  stack  to  be  supported,  as  the  interrupt  processing 
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does  not  cause  the  system  stack  to  grow  further.  Remember,  the  per-user  system 
stack  is  already  in  use  because  the  user  was  processing  a  system  call  when  the 
interrupt  occurred. 

The  interrupt_common  entry  point  to  the  interrupt  handler  shown  in 
Asynchronous  Context  Switch  (section  5.3)  needs  to  be  expanded  to  distinguish 
between  interrupting  User  mode  and  interrupting  Supervisor  mode.  The  appropriate 
processing  requirement  is  determined  by  examining  the  OPS  register  in  the  interrupt 
handler.  The  label  user_interrupt  should  be  used  to  select  the  code  for  an  interrupt  of 
User  mode  code. 

interrupt_common :  /examine  processor  mode  interrupted 

mfsr  itl,ops  ;get  OPS  special 

sll  itl,itl,27  /check  SM  bit 

jmpf  user_interrupt  /User  mode  inter, 

nop 


The  following  code  assumes  Supervisor  mode  interrupts  are  not  nested,  because 
the  current  context  is  pushed  onto  the  interrupt  processing  stack  which  is  assumed 
empty.  If  interrupts  are  to  be  nested,  then  the  context  should  be  pushed  on  the  current 
memory  stack  once  it  has  been  determined  that  the  msp  has  already  been  assigned  to 
the  interrupt  memory  stack.  IM_STACK  and  IR_STACK  are  the  addresses  of  the 
bottom  of  the  interrupt  memory  and  register  stacks  respectively. 


. equ  WS,512  /Window  Size 

.macro  const32 , reg, data 

const  reg, data  /zero  high,  set  low 

consth  reg, data  /high  16-bits 

.  endm 


supervisor_interrupt : 


const32 

it  1 , 

IM _ STACK 

push 

it  1 , 

itO 

push 

it  1 , 

gr  1 

const 

itO, 

WS 

sub 

rab, 

rfb, itO 

pushsr 

itl , 

itO , pcO 

pushsr 

it  1 , 

itO , pci 

pushsr 

itl , 

itO , pc2 

pushsr 

itl , 

itO, cha 

pushsr 

itl , 

itO , chd 

pushsr 

itl , 

itO , chc 

pushsr 

itl , 

itO, alu 

pushsr 

itl, 

itO , ops 

mtsrim 

cps, 

DISABLE 

push 

itl , 

tav 

mtsrim 

chc, 

■  0 

/process  Supervisor  mode  interrupt 
/ interrupt  M-stack 
/ stack  signal  id 

/set  rab=rfb-512 
/push  specials 

/remove  freeze 

/ clear  CHC 
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push 

itl, lrl 

/push  R-stack 

push 

itl, rfb 

/  support 

push 

pcb, msp 

/push  M-stack  pntr. 

add 

msp, itl , 0 

/use  msp  pointer 

mtsrim 

cps , ENABLE 

/enable  interrupts 

push 

msp, FPStatO 

/ floating  point 

push 

msp, FPStatl 

push 

msp, FPStat2 

pushsr 

msp, ktO, ipa 

/more  specials 

pushsr 

msp, ktO , ipb 

pushsr 

msp, ktO, ipc 

pushsr 

msp, ktO, q 

sub 

msp, msp, 29*4 

/push  gr96-grl24 

mtsrim 

cr , 2  9-1 

/  including  optional 

storem 

0,0,  gr 96, msp 

/  grll2-grll5 

There  is  no  need  to  save  any  of  the  register  cache  data.  In  the  following  code,  the 
register  stack  support  registers  are  updated  with  the  initial  values  of  the  supervisor 
interrupt  stack.  If  nested  high  level  handler  interrupts  are  to  be  supported,  see  the 
following  Optimizations  section.  The  grl  register  stack  pointer  is  then  set  to  the  top 
(rab)  of  the  cache,  indicating  the  cache  is  fully  in  use.  The  new  activation  record  size 
pointer,  Irl,  is  then  set  to  the  bottom  of  the  cache  (rfb). This  ensures  that  when  the 
interrupted  C  level  service  function  returns,  the  cache  will  be  repaired  to  exactly  the 
position  at  which  the  interrupt  occurred.  This  is  particularly  important  if  a  Supervisor 
mode  FILL  was  interrupted.  The  user’s  system  mode  register  data  will  be  spilled 
onto  the  interrupt  stack,  but  this  creates  no  problem. 


const32  rab, IR_STACK-WS 

add  grl, rab, 0  /prepare  interrupt 

const32  rfb, IR_STACK  ;  R-stack  support 

add  lrl,rfb  ;  registers 

t 

/call  appropriate  C-level  interrupt  handler 


5.6.1  Optimizations 

The  code  shown  above  does  not  attempt  to  determine  the  region  of  cache 
registers  currently  in  use.  This  means  that  the  first  C  level  procedure  call  in  the 
interrupt  handler  will  result  if  a  cache  spill  trap  occurs. 

By  determining  the  region  of  the  cache  currently  in  use  and  by  bringing  the 
register  stack  into  a  valid  condition,  any  available  cache  registers  can  be  made 
immediately  available  to  the  interrupt  handler  C  routines.  The  code  following  the 
label  R_fixup:  in  the  previous  User  Mode  Signals  section  (section5.5)  does  bring  the 
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stack  into  a  valid  condition  and  can  be  used  to  replace  the  code  shown  above.  Once 
the  stack  support  registers  are  restored  to  a  valid  state,  the  stack-cut-across  method 
described  in  the  User  System  Calls  section  (section  5.7)  can  attach  the  cache  to  the 
interrupt  register  stack. 

It  is  possible  that  while  processing  an  interrupt  (which  means  the  processor  is 
already  in  Supervisor  mode)  an  additional  interrupt  occurs.  If  an  operating  system 
supports  nested  interrupts,  then  the  code  in  the  Interrupting  Supervisor  Mode  section 
(section  5.6)  will  be  executed  again.  This  overhead  can  be  avoided  by  following  the 
Interrupt  Queuing  Model  method  described  in  section  4.3.12  of  the  Interrupts  and 
Traps  chapter. 

The  method  relies  on  supporting  only  lightweight  interrupt  nesting.  The  code  in 
this  section  is  entered  only  once  to  start  the  execution  of  a  C  level  interrupt  processing 
Dispatcher.  Each  interrupt  adds  a  interrupt  request  descriptor  (bead)  on  to  a  queue  of 
descriptors  (string  of  beads).  The  dispatcher  removes  the  requests  and  processes  the 
interrupt  until  the  list  becomes  empty.  Lightweight  interrupts  enable  the  external 
device  to  be  quickly  responded  to,  although  the  dispatcher  may  not  complete  the 
processing  till  some  time  later. 

5.7  USER  SYSTEM  CALLS 

User  programs  usually  switch  stacks  when  executing  system  calls.  The  user’s 
system  stack  is  not  accessible  to  the  User  mode  program.  This  keeps  Supervisor 
information  which  appears  on  the  stack  during  system  call  execution  hidden  from  the 
user. 

Synchronous  context  switching  generally  happens  as  a  result  of  a  system  call. 
However,  system  calls  are  also  used  to  request  the  operating  system  to  obtain 
information  for  a  user  which  is  only  directly  obtainable  with  Supervisor  access 
privileges.  The  user’s  state  must  be  saved  to  the  PCB  structure  in  a  similar  way  to  a 
synchronous  context  save.  This  makes  the  global  and  special  registers  available  for 
Supervisor  mode  C  function  use.  There  is  no  need  to  save  the  register  cache  until  a 
full  context  switch  is  known  to  be  required. 


.equ  SIG_SYNC,  -1 

.equ  ENABLE, (SM | PD | PI) 

.equ  DISABLE, (ENABLE | DI | DA) 


syscall : 


constn 

itO,  SIG_SYNC 

push 

pcb, itO 

push 

pcb, grl 

push 

pcb, rab 

pushsr 

pcb, itO, pcO 

pushsr 

pcb, itO , pci 

sub 

pcb, pcb, 3*4 

const 

itO,  0 

;V_SYSCALL  trap  handler 
;  assumes  no 
;  outstanding  DERR 
;push  grl 
;push  rab 
;push  specials 

; space  pc2,cha,chd 
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push 

pcb, itO 

;push  CHC=0 

sub 

pcb, pcb, 1*4 

; space  for  alu 

pushsr 

pcb, ops 

;push  OPS 

mtsrim 

cps , DISABLE 

; remove  freeze 

sub 

pcb, pcb, 1*4 

; space  for  tav 

push 

pcb, lrl 

; stack  support 

push 

pcb, rfb 

;push  rfb 

push 

pcb, msp 

;push  M-stack  pnt . 

mtsrim 

cps , ENABLE 

; enable  interrupts 

push 

pcb, FPStatO 

; floating  point 

push 

pcb, FPStatl 

push 

pcb, FPStat2 

;Assume  the  same  grll2-grll5  data  is  shared 
; by  all  users  and  the  supervisor,  and 
;therefor  will  not  push  grll2-grll5. 

f 

;Align  pcb  for  system  call  return 

sub  pcb, pcb,  ( 4+  ( 124-96  +  1 )) *4 

The  system  call  code  can  continue  to  use  the  cache  attached  to  the  user’s  system 
mode  registers  stack.  To  do  this  the  current  top  of  stack  position,  grl,  must  be 
maintained.  The  register  stack  support  registers  are  relocated  to  the  system  stack, 
maintaining  the  existing  stack  position  offset.  The  following  code  performs  this  stack 
cut-across  operation.  It  assumes  the  system  call  is  made  from  a  valid  stack  condition. 
However,  it  includes  bounds  protection  because  operating  systems  can  never 
completely  rely  on  users  always  maintaining  valid  stack  support  registers. 


sub 

andn 

const 

cpleu 

jmpt 

const 

const 


gr96, rfb, grl 
gr96, gr96, 3 
gr 97 ,  (128*4) 
gr97, gr96, gr97 
gr97, $1 
gr97, Oxlfc 
gr96, 512 


; determine  rfb-grl 
; stack  is  double  word  aligned 
;max  allowed  value  for 
;  rfb-grl  is  128*4 
; jump  if  normal  register  usage 
;mask  for  page  displacement  math 
; limit  register  use  to  max  (512) 


$1: 


and 

const 

consth 

add 

add 

const 

sub 

add 


grl,grl,gr97  ; determine  grl  displacement  within 
gr 97 , SR_STACK-1024 ;  512-byte  page 
gr 97 , SR_STACK-1024  ; 


grl , grl , gr 97 
rfb, grl, gr96 
gr 97 ,  (128*4) 
rab, rfb, gr97 
lrl , rfb, 0 


;  grl=SR_STACK-102  4+displacement 
;rfb=(new  grl)+ 
min (512,  rfb-grl) ) 
set  rab=rfb-512 

; ensure  all  User  mode  registers 
restored 


The  technique  relies  on  keeping  bits  8-2  of  the  stack  pointer,  grl ,  unchanged.  In 
other  words,  the  IrO  register  has  the  same  position  in  the  cache  after  the  memory 
resident  stack  portion  has  been  exchanged.  This  is  achieved  by  calculating  the 
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address  displacement  of  grl  within  a  512-byte  page  size.  The  grl  displacement 
remains  the  same  if  the  memory  resident  portion  of  the  register  stack  has  been 
exchanged.  SM_STACK  and  SR_STACK  are  the  addresses  of  the  bottom  of  the 
per-user  system  memory  and  register  stacks  respectively  (see  Figure  5-5). 


512-byte 
page  bound¬ 
aries 


Register  stack 
support  registers 
before  system 
call 


register 
cache 
portion  of 
register 
stack 


SR_STACK 

Supervisor  Mode 
Register  stack 


UR_STACK 

User  Mode 
Register  stack 


rfb 


page 

displacement 


SR_STACK 

rfb 


grl 


Register  stack  support 
registers  after  stack 
cut-across 


UR_STACK 


registers 

or 

memory  in 
use 


Figure  5-5.  Register  Stack  Cut-Across 

Once  stack  cut-across  has  been  completed,  a  call  to  the  C  level  system  call 
handler  can  be  issued.  The  C  code  may  get  its  incoming  parameters  from  the  register 
stack,  or  the  system  call  trap  handler  code  may  copy  the  parameters  from  the  local 
registers  to  memory  locations  accessible  by  the  Supervisor  mode  C  handler. 

;  copy  lr2,  .  .  .  arguments  to  memory  locations 

add  gr96,tav,0  ; save  service  numb, 

sub  grl, grl, 4*4  ; new  stack  frame 

asgeu  V_SPILL,  grl ,  rab 

add  lrl,grl,4*2  ;ensure  lrl  restore 

const32  lrO,_syscall  ;C  handler 

calli  Ir0,lr0  ;  call 

add  Ir2,gr96,0  ;pass  service  numb. 

The  C  system  call  handler  may  place  its  return  values  in  known  memory 
locations,  rather  than  global  registers  gr96-grlll.  If  this  is  the  case,  then  the  values 
shall  have  to  be  copied  to  the  normal  return  registers.  System  calls  indicate  their 
successful  or  unsuccessful  completion  to  their  callers  by  setting  register  tav  (grl 21 ) 
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to  TRUE  or  FALSE;  the  high  level  handler  achieves  this  by  modifying  the  grl21 
register  location  in  the  PCB  before  the  system  call  return  code  is  executed.  A  FILL 
assertion  is  used  to  repair  the  cache  to  the  position  at  which  the  system  call  was 
issued. 


add  grl,grl,16  ; restore  system 

nop  ;  call  frame  lrl 

asleu  V_FILL, lr 1 , rfb  ; restore  all  cache 

; copy  return  values  from  memory  to  gr96, . . . 

jmp  resume  ; restore  context 

nop 

Because  a  U ser  mode  signal  handler  may  use  the  system  call  mechanism  to  issue 
a  sigreturn,  it  is  possible  an  asynchronous  context  restore  may  be  required  in  place  of 
the  normal  synchronous  context  restore  associated  with  a  system  call.  Label  resume 
is  jumped  to  and  is  described  in  the  Restore  Context  section  (section  5.10).  If  an 
asynchronous  context  is  being  restored,  then  a  pointer  to  the  context  being  restored 
will  have  been  passed  to  the  sigreturn  system  call.  The  high  level  C  handler  will  have 
copied  this  data  over  the  PCB  data  stored  at  the  time  of  the  system  call  trap  entry.  The 
C  handler  must  change  the  SIG_SYNC  value  stored  in  the  PCB  by  the  system  call 
trap  handler.  This  will  cause  the  resume  code  to  perform  an  asynchronous  rather  than 
synchronous  context  restore. 

5.8  FLOATING-POINT  ISSUES 

The  example  code  presented  saves  only  three  supervisor  accessible  global 
registers  under  the  heading  floating-point  support.  These  registers  are  typically 
ksl3-ksl5.  This  is  sufficient  to  save  and  restore  floating-point  context  when  an 
Am29000  processor  is  being  used  with  trapware  emulation.  This  is  only  true  if 
interrupts  are  turned  off  during  floating-point  trapware  execution.  If  floating-point 
trapware  is  interruptible,  then  the  Am29000  trapware  support  registers  (typically 
it0-it3  and  ktO-ktll)  would  have  to  be  saved. 

When  an  Am29027  floating-point  coprocessor  is  used,  either  inline  or  via 
trapware  support,  the  complete  state  of  the  coprocessor  must  be  saved.  This  requires 
an  additional  35  words  space  in  addition  to  the  three  Am29000  global  support 
registers. 

Some  real-time  operating  systems  may  run  floating-point  trapware  with 
interrupts  off  and  chose  to  save  no  floating  support  registers  at  all.  This  will  improve 
context  switch  times.  User  programs  typically  only  change  the  rounding  mode 
information  in  the  support  registers.  If  all  user  tasks  run  with  the  same  rounding 
information,  then  there  is  no  need  to  save  and  restore  the  three  floating-point  support 
registers. 

The  Am29050  directly  executes  floating  point  instructions  without  the  need  for 
trapware.  It  has  four  floating  point  support  registers,  special  registers  srl60-162  and 
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srl64.  In  fact,  the  three  support  registers  required  by  the  Am29000  are  used  to 
virtualize  these  Am29050  registers.  Saving  Am29050  floating  point  context  would 
be  achieved  by  saving  these  four  registers  and  the  four  double  word  accumulator 
values.  However,  the  Am29050  does  not  directly  support  integer  DIVIDE  and 
DIVIDU  instructions.  The  trapware  which  implements  these  instructions  requires 
six  support  registers  (typically  kt0-kt5).  If  this  trapware  is  interruptable,  then  these 
registers  would  also  have  to  be  saved. 


5.9  DEBUGGER  ISSUES 

Debuggers  such  as  AMD’s  MiniMON29K  monitor  have  a  special  context 
switch  requirement.  They  need  to  be  able  to  switch  context  to  the  debugger  from  a 
running  application  or  operating  system  without  losing  the  contents  of  any  processor 
register.  One  possibility  is  to  reserve  a  global  register  in  the  range  gr64-gr95, 
specifically  for  debugger  support.  But,  most  operating  system  developers  are 
unwilling  to  give  up  a  register. 

A  technique  which  avoids  losing  a  register  for  operating  system  use  is  to  use  gr4 
to  first  store  a  single  operating  system  register,  and  then  use  this  register  to  start 
saving  the  rest  of  the  processor  context.  The  Am29000  does  not  have  a  gr4  register 
but  the  ALU  forwarding  logic  enables  this  technique  to  work.  The  code  example 
below,  taken  from  MiniMON29K,  shows  how  the  processor  context  save  gets 
started.  Note,  dbg  glob  reg  is  the  memory  address  used  by  the  debugger  to  save 
global  registers. 

.macro  const32 , reg, data 

const  reg, data  ; zero  high,  set  low 

consth  reg, data  ;high  16-bits 

.  endm 

dbg_V_bkpt : 

const32  gr4 , _dbg_glob_reg+96*4 
store  0,0,gr96,gr4  ; save  gr96 

const32  gr96,_dbg_glob_reg+97*4 
store  0, 0, gr97, gr96  ; save  gr97 

add  gr96, gr96, 4 

store  0, 0, gr98, gr96  ; save  gr98 

r 

call  gr96, store_state 

const  gr97,V_BKPT 

Label  dbg_V_bkpt  is  the  address  vectored  to  by  an  illegal  opcode 
(MiniMON29K  uses  these  to  implement  breakpoints  on  the  Am29000).  When 
function  store_state  is  reached,  global  registers  gr96-gr98  have  already  been  saved. 

The  gr4  user  should  be  careful  to  remember  that  the  Am29000  ALU  forwarding 
logic  only  keeps  the  gr4  register  value  alive  for  1-cycle  following  its  modification. 
Additionally,  because  emulators  also  make  use  of  gr4  in  analyzing  processor 
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registers,  it  is  not  possible  to  use  an  emulator  to  debug  the  monitor  entry  code  shown 
above. 


5.10  RESTORING  CONTEXT 

The  supervisor  register  pcb  must  point  to  the  top  of  the  process  control  block 
stack  describing  the  previously  saved  context.  A  test  of  the  signal  number  data 
located  at  the  bottom  of  the  PCB  stack  enables  us  to  determine  if  the  stack  was  saved 
synchronously  or  asynchronously.  Restoring  synchronously  saved  tasks  can  be 
achieved  more  quickly  because  there  is  less  relevant  data  in  the  PCB  stack. 


resume : 

add  ktO,  pcb, CTX_SIGNUMB 

load  0,0,kt0,kt0  ; sync/async  save  ? 

jmpt  ktO, sync_resume 

nop 

Asynchronously  saved  states  have  a  greater  number  of  global  registers  to  be 
restored.  There  are  also  additional  special  register  values. 

async_resume : 


mt srim 

cr , 2  9-1 

; restore  gr96-124 

sub 

pcb, pcb, 29*4 

loadm 

0,0, gr96, pcb 

popsr 

q, itO, pcb 

; restore  specials 

popsr 

ipc, itO, pcb 

popsr 

ipb, itO, pcb 

popsr 

ipa, itO, pcb 

jmp 

fp_resume 

nop 


Now  that  the  context  information,  unique  to  an  asynchronously  saved  state,  has 
been  restored,  the  context  which  is  common  between  asynchronous  and 
synchronous  save  states  can  be  restored  via  a  jump  to  fp_resume. 


sync_resume : 


add 

pcb, pcb, 16*4 

r 

mtsrim 

cr, 4-1 

loadm 

0,0, grll2, pcb 

add 

pcb, pcb, 4*4 

r 

add 

pcb, pcb, 9*4 

add 

pcb, pcb, 4*4 

fp_resume : 

pop 

FPStat2 , pcb 

; space  for  gr96-lll 
;optional  restore  of  grll2-115 

; space  for  grll6-124 
; space  for  IPA-Q 

; floating  point 
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pop  FPStatl,pcb 

pop  FPStatO,pcb 

Now  that  most  of  the  global  and  User  mode  accessible  special  registers  have 
been  restored,  it  is  time  to  restore  the  register  cache.  In  the  case  where  they  were 
saved  due  to  an  asynchronous  event,  this  requires  care.  First  the  register  stack  support 
registers  must  be  restored. 


.  equ 

DISABLE, (SM | PD 

PI | DI | DA) 

mtsrim 

cps,  DISABLE 

pop 

msp, pcb 

/M-stack  support 

pop 

rfb, pcb 

/R-stack  support 

pop 

Irl , pcb 

add 

ktl, pcb, 9*4 

pop 

rab, ktl 

pop 

grl , kt 1 

add 

grl, grl, 0 

/ alu  operation 

By  examining  the  register  stack  support  pointers  it  is  possible  to  determine  if  the 
process  state  was  stored  during  a  SPILL  interrupt.  In  this  case  the  saved  grl  will  be 
more  than  a  window  distance  below  rfb,  this  means  [grl]-[rfb]  should  not  be 
restored.  In  the  case  of  restoring  an  interrupted  far-longjmp,  the  cache  need  not  be 
restored. 


.  equ  WS,512  ;Wmdow  Size 

;  If  User  mode  uses  virtual  addressing, 
/restore  PID  field  in  MMU  register 


;to  PID  of  incoming  task. 


sub 

ktO, 

.  rfb, 

.  rab 

srl 

ktO, 

.  ktO, 

.  2 

cpleu 

ktl , 

.  ktO, 

,  WS»2 

jmpt 

ktl, 

.  rnorm 

cpgtu 

ktl , 

•  grl, 

.  rfb 

jmpt 

ktl, 

.  rll jmp 

nop 

rspill  : 


const 

ktO, WS 

sub 

ktl,  rfb, ktO 

add 

ktO, grl, 0 

add 

grl , ktl , 0 

mtsrim 

CR,  ( 5 1 2  >  >2 ) -1 

mtsrim 

cps, ENABLE&-PD 

loadm 

0,  UA, IrO, ktl 

mtsrim 

cps , ENABLE 

jmp 

r  1 1  j  mp 

add 

grl, ktO, 0 

/window  size 
/ convert  to  words 
/test  for  normal 
/  or  FILL  interrupt 
/test  for  far- 
/  longjmp  interrupt 


/restore  interrupted  spill 

/determine  rab 

/ save  interrupted  grl 
/set  grl=rfb- (window  size) 

/virtual  data 
/ load  all  of  cache 
/physical  data 

/restore  interrupted  grl 
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When  synchronously  saved  tasks  are  restored,  or  asynchronously  saved  tasks 
which  were  interrupted  during  either  a  normal  register  stack  condition  or  an 
interrupted  FILL,  local  registers  [grl]-[rfb\  are  restored  to  the  cache. 


rnorm:  sub 

ktO, rfb, grl 

; determine  number  of  bytes 

srl 

ktO, ktO, 2 

; adjust  to  words 

sub 

ktO, ktO,  1 

mtsr 

CR, ktO 

mtsrim 

cps, ENABLE&-PD 

; virtual  data 

loadm 

0 ,  UA, IrO , grl 

; restore  R-stack  cache 

mtsrim 

cps , ENABLE 

; physical  data 

Now  that  the  local  registers  have  been  restored,  all  that  remains  to  do  is  restore 
the  remaining  special  registers.  This  requires  applying  Freeze  mode  with  intermpts 
disabled  during  this  critical  stage. 


rll jmp : 

pop 

mtsrim 

popsr 

popsr 

popsr 

popsr 

popsr 

popsr 

popsr 

popsr 

iret 


tav, pcb 
cps , FREEZE 
ops, itO, pcb 
alu, itO, pcb 

chc,  itO, pcb 

chd,  itO, pcb 
cha, itO, pcb 
pc2 , it 0 , pcb 
pci, itO, pcb 
pcO, itO, pcb 


; frozen  specials 


5.11  INTERRUPT  LATENCY 

Interrupt  latency  is  an  important  issue  for  many  real-time  applications.  I 
defined  it  as  the  time  which  elapses  between  identifying  the  interrupting  device’s 
request  and  performing  the  necessary  processing  to  remove  the  request.  Latency  is 
increased  by  having  interrupts  disabled  for  long  periods  of  time.  Unfortunately  it  is 
desirable  to  have  operating  system  code  perform  context  switching  with  interrupts 
disabled. 

Consider  the  case  where  a  User  mode  process  is  interrupted  and  a  signal  is  to  be 
sent  to  the  process.  The  operating  system  starts  saving  the  interrupted  process  context 
on  the  user’s  memory  stack.  However,  in  the  process  of  doing  this  an  interrupt  is 
generated  by  a  peripheral  device  requiring  Supervisor  mode  C  level  interrupt  handler 
support.  This  second  interrupt  requires  a  context  switch  to  the  Supervisor  mode 
interrupt  stack.  In  the  process  of  preparing  the  processor  to  run  the  C  level  handler, 
the  context  switch  code  may  become  confused  about  the  state  of  the  stack  support 
registers  as  a  result  of  partial  changes  made  by  the  interrupted  signal  handler 
operating  system  code.  Additionally,  there  is  likely  to  be  register  usage  conflict 
between  the  different  operating  system  code  support  routines. 
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The  status  confusion  and  register  conflict  is  avoided  by  disabling  interrupts 
during  the  critical  portions  of  the  operating  system  code.  The  code  shown  in  this 
chapter  enables  interrupts  after  the  frozen  special  registers  and  stack  support 
registers  have  been  saved.  This  is  insufficient  to  deal  with  the  nested  interrupt 
situation  described  above.  However,  this  does  reduce  interrupt  latency,  which  is  a 
concern  to  real-time  29K  users.  Some  implementors  may  chose  to  move  the  enabling 
of  interrupts  to  a  later  stage  in  the  operating  system  support  code  —  more 
specifically,  to  a  point  after  register  stack  support  registers  have  been  assigned  their 
new  values.  Register  usage  changes  will  also  be  required  to  avoid  conflict. 

Within  the  example  code  used  throughout  this  chapter,  interrupts  can  be  enabled 
just  after  special  register  CHC  has  been  saved  (before  Irl  is  pushed  on  the  PCB).  This 
low  latency  technique  enables  lightweight  interrupt  handlers  to  be  supported  during 
the  operation  of  normally  critical  operating  system  code.  Lightweight  handlers 
typically  only  run  in  Freeze  mode  and  can  easily  avoid  register  conflict  if  they  are 
restricted  to  global  registers  il()-il3.  Using  the  Interrupt  Queuing  Model  described  in 
section  4.3.12,  or  the  Signal  Dispatcher  described  in  section  2.5.6,  a  lightweight 
handler  responds  to  the  peripheral  device  interrupt.  It  transfers  any  critical  peripheral 
device  data  and  clears  the  interrupt  request.  In  doing  so,  it  inserts  an 
interrupt-descriptor,  or  signal  number,  into  a  queue  for  later  processing. 

A  Supervisor  C  level  interrupt  handler  known  as  the  Dispatcher  removes  queue 
entries  and  calls  the  appropriate  handler  to  process  them.  If  the  operating  system  is 
interrupted  in  a  non-critical  region  by  a  device  requiring  a  Supervisor  mode  C  level 
handler,  then  the  dispatcher  is  immediately  started.  If  the  interrupt  is  in  a  critical 
region  then  the  Dispatcher  shall  be  started  later  when  the  current  critical  tasks  have 
been  completed.  If  the  Dispatcher  is  already  running  when  the  interrupt  occurred, 
then  the  associated  interrupt  descriptor  shall  wait  in  the  queue  until  the  Dispatcher 
removes  it  for  processing. 

The  use  of  a  Dispatcher  and  interrupt  queuing  helps  to  reduce  interrupt  latency 
via  the  use  of  lightweight  interrupts  when  building  queue  entries.  However,  the 
method  has  some  restrictions.  It  works  where  troublesome  nested  interrupt  servicing 
can  be  partially  delayed  for  later  high  level  handler  completion.  But  some  interrupts 
can  not  be  delayed.  For  example  an  operating  system  may  be  mnning  with  address 
translation  turned  on,  and  a  TLB  miss  may  occur  for  an  operating  system  memory 
page  which  needs  the  support  of  a  high  level  handler  to  page-in  the  data  from  a 
secondary  disk  device.  In  this  case  the  interrupt  must  be  completely  serviced 
immediately.  This  is  not  a  typical  environment  for  29K  users  in  real-time 
applications.  And  even  in  many  non-real-time  operating  system  cases  the  operating 
system  runs  in  physical  mode  or  all  instruction  and  data  are  known  to  be  currently  in 
physical  memory.  The  trade-offs  required  in  deciding  when  to  enable  interrupts  and 
resolving  register  conflict  are  specific  to  each  operating  system  implementation. 
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5.12  ON-CHIP  CACHE  SUPPORT 

First  level  caches  are  small  on-chip  memories  which  can  respond  on  behalf  of 
off-chip  memory  when  a  processor  attempts  a  memory  access.  When  the  required 
access  is  satisfied  by  the  cache,  known  as  a  cache  hit,  a  performance  advantage  is 
obtained  when  compared  to  accessing  slower  off-chip  memory.  Caches  enable  high 
performance  systems  to  be  constructed  without  the  expense  and  complexity  of  fast 
system  memory. 

The  29K  family  supports  a  mixture  of  different  cache  schemes,  see  Table  5-1. 
Some  of  the  inexpensive  devices  such  as  the  Am29005  processor  and  the  Am29200 
microcontroller  have  no  on-chip  cache.  Other  family  members  generally  have  some 
kind  of  instruction  memory  cache;  and  in  some  of  the  top  performing  processors,  data 
cache  is  provided.  The  individual  processor  User ’s  Manual  describes  the  operation  of 
the  available  cache  in  detail.  Chapter  1  outlined  the  basic  cache  capabilities  of  the 
family  (see  sections  1.3-1. 9).  This  section  deals  with  the  support  code  needed  to 
maintain  cache  operation.  Some  cache  operations  are  described  in  more  detail  for  the 
purpose  of  showing  how  cache  maintenance  affects  system  performance. 

When  a  cache  is  provided,  the  29K  family  supports  two-way  set  associative 
caching.  The  two-way  cache  associativity  (see  section  6.2)  provides  two  possible 
locations  (blocks  or  cache  entries)  for  caching  any  selected  memory  location.  A 
block  contains  four  contiguous  words  from  memory  and  associated  tag  and  status 
bit-fields.  When  a  cache  miss  occurs,  and  both  associated  blocks  are  valid  but  not 
locked  (can  be  displaced),  a  block  is  chosen  at  random  for  replacement  (known  as 
reload ).  Investigations  have  shown  that  random  replacement  can  be  more  successful 
than  a  Least  Recently  Used  (LRU)  replacement  scheme. 

When  a  29K  processor  is  reset,  the  processor  disables  all  caches  by  setting  the 
cache  disable  bit-fields  in  the  CFG  configuration  register.  Cache  entries  must  first  be 
invalidated  before  the  cache  is  enabled.  Supervisor  mode  code  can  perform  most 
operating  system  cache  maintenance  services  by  simply  manipulating  the  bit-fields 
of  the  CFG  register.  In  addition  Supervisor  mode  privileged  instructions  are  provided 
for  cache  invalidation. 

5.13  INSTRUCTION  CACHE  MAINTENANCE 

Instruction  cache  memory  has  typically  a  larger  impact  on  performance  than 
data  cache  with  the  29k  family.  This  is  due  to  the  reduced  number  of  data  accesses 
required  by  application  code.  The  reduction  is  relative  to  other  processors,  generally 
CISC,  which  have  a  small  number  of  on-chip  registers.  Application  data  is  normally 
held  in  the  128-word  register  file  which  is  a  cache  of  the  top  of  the  application 
register  stack. 

The  potentially  higher  performance  of  a  RISC  chip  is  only  achieved  if  the 
instruction  pipeline  is  kept  effectively  busy.  The  RISC  engine  is  instruction  hungry 
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Table  5-1.  29K  Family  Instruction  and  Date  Cache  Support 


I-cache  D-cache 


Am29005 

- 

- 

Am29000 

BTC  32x16 

- 

Am29050 

BTC  64x16 
or  128x8 

- 

3-bus  Microprocessors 


I-cache  D-cache 


Am29035 

4k 

- 

Am29030 

8k 

- 

Am29040 

8k 

4k 

2-bus  Microprocessors 


All  cache  sizes  in  bytes 


I-cache  D-cache 


Am29205 

- 

- 

Am29200 

- 

- 

Am29245 

4k 

- 

Am29240 

4k 

2k 

Am29243 

4k 

2k 

Microcontrollers 


and  to  prevent  stalling  it  must  be  kept  fed  with  instructions  from  cache  memory  or  a 
high  bandwidth  off-chip  memory  system  (see  section  1.10).  On-chip  cache  can 
supply  instruction  sequences  at  a  rate  of  one  per  cycle  without  any  initial  access 
penalties.  Thus  they  can  keep  the  pipeline  fed  without  any  stalling  due  to  lack  of 
available  instructions  to  process. 

The  original  3-bus  family  members  have  a  Branch  Target  cache  due  to  the 
improved  access  to  off-chip  memory  made  possible  with  three  busses.  Later  2-bus 
and  microcontroller  family  members  have  a  more  conventional,  bandwidth 
improving,  instruction  cache.  It  is  interesting  to  consider  the  benefits  of  an 
instruction  cache  when  the  memory  system  is  able  to  support  single  cycle  memory 
access.  For  example,  the  buit-in  DRAM  controller  used  in  the  Am29240 
microcontroller  is  able  to  support  single  cycle  burst-mode  access.  An  instruction 
cache  can  not  improve  on  the  1-cycle  memory  access.  However,  the  cache  still  hides 
the  initial  access  penalties  incurred  when  starting  a  new  burst  sequence.  It  also 
enables  parallel  LOAD  and  STORE  instruction  execution,  the  processor  pipeline 
being  supplied  by  the  instruction  cache  while  the  data  bus  is  free  to  perform  a  data 
access  (see  sectionl.7.2). 

The  required  cache  maintenance  software  does  not  present  much  of  an 
overhead.  Because  the  address  in  the  program  counter  is  presented  to  the  instruction 
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cache  at  the  same  time  it  is  presented  to  the  MMU,  the  instruction  cache  does  not 
operate  with  physical  addresses  if  the  MMU  is  in  use.  Thus,  the  29K  family 
instruction  caches  operate  with  virtual  addresses  when  testing  for  a  cache  hit. 
Because  cache  entries  are  not  tagged  with  a  per-process  identifier  the  cache  must  be 
flushed  when  a  process  (or  task)  context  switch  occurs.  This  is  to  prevent  a  previous 
process’s  virtual  address  appearing  to  match  with  the  current  task’s  virtual  address. 
Only  systems  which  operate  with  multiple  tasks  using  virtual  addressing  must 
invalidate  the  cache  when  a  user-task  context  switch  occurs.  Using  the  IRETINV 
(interrupt  return  and  invalidate)  instruction  is  one  convenient  way  of  doing  this. 
However,  if  the  processor  runs  tasks  with  physical  addressing,  there  is  no  need  to 
flush  the  cache  on  a  process  (task)  context  switch.  With  physical  addressing,  each 
task  is  restricted  to  execution  within  a  limited  and  possibly  unique  range  of  the 
available  address  space. 

The  instruction  cache  is  enabled  by  clearing  the  Instruction  Cache  Disable  (ID) 
bit  of  the  CFG  configuration  register  (the  CD  bit  is  used  with  3-bus  processors). 
Cache  entries  are  built  around  blocks  of  four  consecutive  instructions.  Each  block 
has  some  associated  tag  and  status  information.  This  information,  shown  on 
Figure  5-6,  is  the  same  for  each  processor.  However,  the  exact  layout  of  the  bit-fields 
may  vary  among  family  members. 


Address  Tag 

V 

P 

US 

Figure  5-6.  Instruction  Cache  Tag  and  Status  bits 


The  Valid  (V)  bit-field  indicates  if  the  cache  entry  is  valid.  Forprocessors  which 
have  a  1-bit  field,  setting  this  bit  means  all  four  instructions  are  valid  cache  entries. 
When  a  family  member  supports  a  4-bit  field,  a  separate  bit  is  used  to  indicate  a  valid 
entry  for  each  of  the  four  cached  instructions. 

Each  block  has  a  P  bit-field.  This  bit  indicates  that  the  tagged  address  relates  to  a 
physical  address  value.  The  P  bit  becomes  set  when  the  cache  is  reloaded  while  the  PI 
(Physical  Instruction)  bit  in  the  CPS  register  is  set.  This  allows  cache  entries  to  hold 
interrupt  handlers  which  typically  run  with  physical  addressing.  The  interrupt 
handler  code  can  be  distinguished  from  User  mode  and  Supervisor  mode  virtually 
addressed  code. 

When  the  cache  is  invalidated  using  an  INV  type  instruction  all  valid  bits  are 
reset,  even  entries  which  were  valid  and  had  their  P  bit  set.  In  some  cases  there  may  be 
a  performance  gain  to  be  had  by  not  invalidating  physical  cache  entries  but  only 
virtual  addresses  entries.  However,  the  performance  gain  is  small  and  the  on-chip 
silicon  overhead  for  this  feature  would  be  relatively  high. 
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The  US  bit-field  of  each  cache  block  tag  indicates  if  the  address  relates  to  User 
mode  or  Supervisor  mode  code.  The  US  bit  becomes  set  when  the  cache  is  reloaded 
while  the  SM  bit  is  set  in  the  CPS  register.  This  allows  cache  entries  to  be  used  for 
both  User  mode  and  Supervisor  mode  code  at  the  same  time,  and  entries  can  remain 
valid  during  application  system  calls  and  system  interrupt  handlers  which  execute  in 
Supervisor  mode. 

Following  sections  present  further  detail  about  instruction  caching  for 
individual  29K  family  members.  Table  5-2  summarizes  this  information. 


Table  5-2.  Instruction  Cache  Comparison 


Processor 

Am29000 

Am 29050 

Am 29030 

Am 29240 

Am29040 

Addressing 

Virtual 

Virtual 

Virtual 

Cache  associativity 

2-way  set 

2-way  set 

2-way  set 

Valid  bits  per  block 

4  bit 

1  bit 

4  bit 

Per-process  identifiers 

No 

No 

No 

Replacement  selection 

Random 

Random 

Random 

Direct  cache  access 

No 

via  CIR  and  CDR 

via  CIR  and  CDR 

Reload  blocking 

No 

Yes 

No 

Target  word  first  reload 

Yes 

No 

Yes 

Cache  locking 

No 

Per-column 

Per-column 

5.13.1  Cache  Locking  and  Invalidating 

Cache  locking  is  an  issue  when  addressing  techniques  other  than  physical  are 
used  by  an  application  or  operating  system.  There  is  often  an  expressed  desire  to  lock 
critical  data  into  the  cache  and  prevent  its  displacement  when  User  mode  address 
translation  changes.  The  objective  is  to  improve  performance  by  out-smarting  the 
random  replacement  algorithm  used  for  cache  reload.  In  practice  this  objective  is 
difficult  to  achieve.  If  code  is  frequently  executed,  and  thus  critical  to  overall 
performance,  it  will  naturally  be  placed  in  the  cache.  The  random  replacement 
technique  is  effective  at  finding  the  critical  code.  It  would  be  difficult  and  possibly 
over  ambitious  to  consider  that  a  programmer,  unless  supported  with  sophisticated 
tools,  could  achieve  a  better  result. 
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The  cache  can  be  invalidated  in  a  single  cycle  using  an  INV  or  IRETINV  type 
instruction.  However  this  invalidates  all  User  and  Supervisor  mode  entries.  It  might 
be  possible  to  improve  the  execution  speeds  of  Supervisor  mode  code  and  interrupt 
handlers  by  keeping  them  locked  in  the  cache.  This  may  also  reduce  interrupt  latency 
times  but  is  no  doubt  at  the  cost  of  reduced  User  mode  code  execution.  The  non  BTC 
processors,  that  is,  the  2-bus  processors  and  microcontroller,  provide  a  means  of 
locking  the  cache. 

Locking  valid  blocks  (or  entries)  into  the  cache  is  not  provided  for  on  a 
per-block  basis  but  in  terms  of  the  complete  cache  or  one  set  of  the  two  columns. 
When  a  column  is  locked,  valid  blocks  are  not  replaced;  invalid  blocks  will  be 
replaced  and  marked  valid  and  locked.  Cache  locking  can  be  applied  before 
preloading  the  cache  with  instruction  sequences  critical  to  performance.  Instruction 
cache  locking  is  achieved  by  setting  the  IL  field  of  the  CFG  configuration  register. 
When  the  cache  is  locked,  an  INV  type  instruction  will  not  cause  block  invalidation 
unless  the  cache  is  also  disabled.  Column  0  and  column  1  of  each  set  can  be  locked  or 
only  column  0  locked.  When  only  column  0  is  locked,  replacement  of  blocks  in 
column  1  continues  on  a  direct  mapping  basis.  That  is,  there  is  only  one  location  in  the 
cache  which  can  cache  any  particular  memory  address.  This  results  in  increased 
cache  reload  activity  which  reduces  the  effectiveness  of  cache. 

As  an  illustrative  exercise,  consider  the  code  necessary  to  invalidate  only  User 
mode  cache  entries.  For  a  4Kbyte  Instruction  cache  there  are  IK  instructions  cached 
in  256  blocks  of  four  instructions.  Given  the  two-way-set  approach,  there  are  128 
sets;  each  set  containing  one  block  in  each  of  the  two  columns.  The  following  code 
scans  the  128  blocks  of  column  0,  and  invalidates  the  block  only  if  the  entry  is  found 
to  cache  User  mode  code.  Note,  the  cache  must  be  disabled  while  being  accessed  via 
the  Cache  Interface  (CIR)  and  Cache  Data  (CDR)  registers.  These  registers  enable 
cache  tags  and  data  to  be  directly  read  and  written. 


const 

gr64,  0x100 

t 

set  the  ID-bit 

mfsr 

gr65,  cfg 

r 

read  CFG  register 

or 

gr65,  gr65, 

gr  64 ; 

disbable  cache 

mtsr 

cfg,  gr65 

r 

write  CFG  config. 

const 

gr 64 ,  128-2 

r 

scan  128  blocks 

const32 

gr65,  0x10000000; 

FSEL=01,  tag  read 

const32 

gr 67 ,  0x01000000; 

R/W  OR  mask 

const 

gr68,  0 

r 

zero  value 

mtsr 

cir,  gr65 

r 

prepare  to  read  tag 

mfsr 

gr66,  cdr 

t 

read  tag-status  word 

sll 

gr66,  gr66. 

31  ; 

test  US-bit 

jmpt 

gr66,  keep 

t 

jump  if  Super,  mode 

or 

gr66,  gr65, 

gr67 ; 

set  the  RW-bit  to  write 

mtsr 

cir,  gr66 

t 

prepare  to  write  tag 

mtsr 

cdr,  gr68 

r 

write  zero  into  status 

keep : 
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jmpfdec  gr64. 

next 

;  test  if  all  blocks  tested 

add 

gr65. 

gr  65 , 

1*16;  point  to  next  block 

const 

gr  64 , 

0x100 

;  set  the  ID-bit 

mfsr 

gr65. 

cfg 

;  read  CFG  register 

nand 

gr65. 

gr  65 , 

gr64;  enable  cache 

mtsr 

cfg, 

gr  65 

;  write  CFG  register 

With  a  2/1  memory  system,  testing  and  invalidating  each  block  takes  10  cycles 
(2/1  refers  to  the  memory  system  access  times  —  2-cycle  first,  1-cycle  for 
subsequent).  This  amounts  to  1280  cycles  for  all  blocks  in  column  0;  or,  51.2  micro 
seconds  for  a  25  MHz  processor.  Actual  use  of  the  example  code  presents  a 
considerable  overhead  and  is  unlikely  to  achieve  an  overall  system  benefit  over 
simply  invalidating  the  whole  cache  in  a  single  cycle. 

5.13.2  Instruction  Cache  Coherence 

The  29K  family  does  not  contain  unified  instruction  and  data  caches.  Unified 
caches  can  give  a  higher  hit  rate  than  split  caches  of  the  same  total  size.  However, 
separate  instruction  and  data  caches  enable  a  higher  performance  due  to 
simultaneous  accesses  during  the  same  processor  cycle.  There  are  less  problems  with 
instruction  cache  coherence  than  data  cache  coherence.  This  is  because  a  memory 
supplying  instructions  is  unlikely  to  be  modified  by  another  processor  or  external 
DMA  controller.  Yet,  a  processor  can  use  store  instructions  to  place  new  instructions 
in  memory  (assuming  a  write-through  policy  described  in  the  following  Data  Cache 
Maintenance  section).  When  this  occurs  it  is  possible  that  the  affected  memory  may 
be  already  located  in  instruction  cache.  It  is  important  that  the  instruction  cache  be 
invalidated  after  self  modifying  code  has  changed  memory  which  will  later  be 
accessed  for  instructions.  Because  cache  invalidation  can  only  be  performed  by 
Supervisor  mode  code,  a  system  call  service  may  be  required  to  invalidate  the  cache. 

The  Instruction  cache  operates  with  virtual  address  tags  when  address 
translation  is  in  use  (physical  instruction  (PI)  bit  clear  in  CPS  register).  The  cache 
tags  do  not  contain  any  per-process  identifiers,  but  can  distinguish  between  User  or 
Supervisor  mode  access.  When  address  translation  is  used,  it  is  possible  that  a  User 
mode  virtual  address  maps  to  the  same  physical  address  as  a  Supervisor  mode  virtual 
address.  However,  the  cache  would  assign  separate  blocks  to  each  of  the  virtual 
addresses.  Hence,  the  instructions  on  shared  instruction  pages  could  be  cached  twice. 
This  results  in  inefficient  use  of  the  cache  but  is  unlikely  to  lead  to  any  problems 
unless  the  instructions  on  the  shared  physical  page  are  modified.  Note,  two  User 
mode  processes  can  not  map  their  virtual  address  to  the  same  physical  page,  as  the 
cache  must  be  invalidated  when  a  process  context  switch  occurs. 

5.13.3  Branch  Target  Cache 

The  Am29000  and  Am29050  3-bus  processors  have  a  Branch  Target  Cache 
(BTC)  which  can  supply  the  first  four  instructions  of  a  previously  taken  branch. 
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The  Am29000  processor  can  cache  32  branch  targets.  The  arrangement  is  the 
usual  two  sets  with  16  blocks  (or  entries)  in  each  set.  The  Am29050  processors  is 
configurable  to  cache  64  branch  targets,  each  block  containing  four  instructions. 
Alternatively,  128  blocks,  still  arranged  in  two  sets,  can  be  used  to  contain  only  two 
instructions.  The  smaller  block  size  makes  more  effective  use  of  the  cache  when  the 
BTC  is  required  to  hide  a  smaller  instruction  memory  access  latency  (see  section 
1.9). 

The  programmer  has  little  control  over  BTC  operation;  it  is  maintained 
internally  by  processor  hardware.  There  is  no  means  of  accessing  or  preloading  the 
cache  via  the  cache  interface  registers  provided  on  other  29K  family  members. 
Additionally,  there  are  no  cache  lock  bits  provided  for  in  the  CFG  register.  The  cache 
can  be  disabled  by  setting  the  CD  bit  in  the  CFG  register;  and  invalidated  by 
executing  an  INV  or  IRETINV  instruction. 

5.13.4  Am29030  2-bus  Microprocessor 

The  Am29030  has  an  8K  byte  instruction  cache;  4K  bytes  being  provided  by 
each  of  the  two  columns.  The  Am29035  only  provides  column  0  and  hence  has  4K  of 
cache  (this  results  in  the  Am29030  having  typically  a  20%  performance  advantage 
for  large  programs) .  These  processors  were  the  first  29K  family  members  to  have  non 
BTC-type  instruction  cache.  When  a  branch  instruction  is  executed  and  the  block 
(cache  entry)  containing  the  target  instruction  sequence  is  not  found  in  the  cache,  the 
processor  fetches  the  missing  block  and  marks  it  valid.  Complete  blocks  are  always 
fetched,  even  if  the  target  instruction  lies  at  the  end  of  the  block.  However,  the  cache 
forwards  instructions  to  the  decoder  without  waiting  for  the  block  to  be  reloaded.  If 
the  cache  is  enabled  and  the  block  to  be  replaced  in  the  cache  is  invalid  and  locked, 
then  the  fetched  block  is  placed  in  the  cache.  Note,  complete  blocks  are  fetched  even 
when  the  cache  is  disabled.  This  is  a  little  wasteful  if  the  target  of  a  jump  or  branch  is 
not  the  first  address  in  a  block. 

Blocks  are  tagged  on  a  per-block  basis.  There  is  only  one  Valid  bit  in  the  block 
status  information.  This  bit  is  not  set  until  the  processor  has  fetched  an  entire  block 
with  no  errors.  Blocks  which  are  fetched  ahead  during  prefetch  buffer  filling  are  not 
marked  valid  if  execution  does  not  continue  into  the  block.  Filling  the  prefetch  buffer 
in  this  way  enables  burst-mode  access  to  be  maintained  for  longer  intervals;  and 
hence  reduce  overall  access  delays.  LOAD  or  STORE  instructions  can  occur  at  any 
time;  however,  the  Am29030  processor  completes  the  fetch  of  the  current  block 
before  starting  the  data  access.  This  is  because  it  is  probably  more  efficient  to 
complete  the  instruction  fetch,  which  is  likely  in  single-cycle  burst-mode.  The 
cache  reload  characteristics  of  the  Am29030  processor  (reload  blocking)  further 
emphasise  the  importance  of  scheduling  LOAD  instructions  ahead  of  the  time  the 
data  is  required  for  further  operations.  The  current  tools  for  the  29K  family  do  not 
support  code  positioning  such  that  the  target  of  call  and  jump  instructions  begin  on  a 
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block  boundary.  This  would  lead  to  an  expansion  of  code  space  requirements  and  is 
likely  to  produce  little  performance  improvement. 

5.13.5  Am29240  and  Am29040  Processors 

The  Am29240  microcontroller  has  a  4K  byte  instruction  cache.  The  Am29040 
2-bus  microprocessor  has  an  8K  byte  instruction  cache.  The  caches  are  implemented 
using  a  similar  two-way  set  associative  architecture.  The  major  difference  from  the 
earlier  Am29030  processor  cache  is  that  the  block  status  information  has  a  valid  bit 
per  instruction.  The  resulting  four  bits  enable  partially  filled  cache  blocks  to  be 
supported.  This  has  been  shown  to  produce  an  average  performance  gain  of  4%  over 
the  valid  bit  per  block  method.  However,  the  performance  difference  may  be  larger 
for  code  which  contains  an  unusually  large  number  of  branch  instructions.  Note,  the 
Am29240  microcontroller  only  caches  instructions  held  in  DRAM  or  SRAM 
address  regions. 

Because  cache  blocks  are  not  tagged  per  block,  it  is  possible  to  interrupt  cache 
reload  with  a  higher  priority  operation.  This  means  LOAD  instructions  need  not  wait 
till  the  end  of  the  current  block  reload  before  they  can  gain  access  to  the  processor 
busses.  Unlike  the  block  oriented  cache  of  the  Am29030,  cache  reload  begins  with 
the  target  instruction  of  a  branch,  not  the  first  instruction  of  the  block.  As  with  the 
Am29030,  instructions  are  forwarded  for  execution  in  parallel  with  cache  block 
reload.  During  instruction  prefetch,  the  next  block  is  fetched  ahead  if  it  is  not  already 
in  the  cache  or  if  any  of  its  valid  bits  are  clear. 

The  instruction  cache  can  be  invalidated  in  a  single  cycle  using  an  INV  or 
IRETINV  instruction.  These  instructions  also  simultaneously  invalidate  the  data 
cache.  To  invalidate  only  the  instruction  cache,  instructions  INVI  and  IRETINVI  are 
provided. 

5.14  DATA  CACHE  MAINTENANCE 

Newer  members  of  the  29K  family  can  operate  with  internal  processor  speeds 
which  are  higher  than  the  off-chip  memory  system  speeds.  This  ability  is  known  as 
Scalable  Clocking.  To  obtain  the  processing  benefits  of  the  higher  internal  pipeline 
speed,  it  becomes  important  to  prevent  pipeline  stalling  due  to  accesses  to  any 
off-chip  data  memory.  For  this  reason,  on-chip  data  cache  has  been  incorporated  into 
the  29K  family.  When  a  cache  hit  occurs,  the  accessed  data  is  supplied  by  the  cache 
rather  than  off-chip  memory.  If  the  number  of  cache  hits  can  be  kept  high,  the 
potential  pipeline  stalling  which  results  from  a  cache  miss  can  be  minimized. 

As  with  instruction  caches,  two-way  set  associative  addressing  is  used  (see 
section  6.2).  However,  unlike  instruction  caches,  29K  family  data  caches  are  always 
accessed  with  physical  rather  than  potentially  virtual  addresses.  Physically 
addressed  caches  have  advantages  over  virtually  addressed  caches.  For  example, 


Chapter  5  Operating  System  Issues 


277 


they  do  not  need  to  be  invalidated  on  a  task  context  switch;  they  do  not  need  extra  tag 
information  to  distinguish  virtual  from  physical  access  and  Supervisor  from  User 
mode  access;  and  importantly,  cache  coherence  problems  are  more  easily  solved  with 
a  physically  addressed  cache.  It  is  somewhat  more  difficult  to  implement  a  physically 
addressed  data  cache.  Virtual  data  addresses  must  first  be  converted  to  physical 
addresses  before  cache  access  can  be  attempted.  The  required  address  translation 
followed  by  the  cache  access  overhead  can  introduce  a  delay  before  the  cache  can 
respond  with  the  requested  data.  As  internal  processor  speeds  increase,  the  cache 
may  not  be  able  to  respond  within  a  single-cycle,  thus  introducing  the  potential  for 
pipeline  stalling  if  load  instruction  scheduling  is  not  performed. 

The  data  cache  is  enabled  by  clearing  the  Data  Cache  Disable  (DD)  bit  in  the 
CFG  configuration  register.  Data  caches  support  accesses  to  byte  and  half-word 
sized  objects  within  a  cached  word.  Cache  tag  information  is  associated  with  each 
block  (or  cache  entry),  and  the  block  size  is  four  words  (16  bytes).  A  2K  byte  data 
cache  would  have  64  sets,  each  containing  two  blocks  (a  total  of  128  blocks  given 
there  is  a  block  for  each  of  the  two  columns  in  a  set).  Individual  cache  entries  can  be 
accessed  via  the  Cache  interface  (CIR)  and  Cache  Data  (CDR)  registers.  These 
registers  enable  the  data  and  tags  of  a  cache  block  to  be  directly  read  and  written. 

There  is  only  one  Valid  (V)  bit  for  each  block.  This  means  blocks  are  never 
partially  filled  and  marked  valid.  A  29K  data  cache  only  allocates  cache  blocks  to 
data  when  a  miss  occurs  during  a  data  load  operation.  This  is  known  as  a 
“read-allocate”  policy.  When  performing  a  data  store  and  an  address  match  is  not 
found  in  the  cache,  no  cache  block  will  be  allocated.  This  “no  write-allocation” 
policy  has  some  advantages.  It  simplifies  the  cache  design,  as  an  “allocate  on  write” 
policy  may  require  a  currently  valid  block  to  be  written-back  to  memory  before  the 
block  is  reallocated  to  cache  the  data  block  causing  the  cache  miss.  This  would  be  a 
complicated  process  as  the  reload  and  write-back  activities  both  require  access  to  the 
system  busses.  Additionally,  the  instructions  following  the  load  instruction  may  also 
require  access  to  the  system  bus  if  they  are  not  being  provided  by  the  instruction 
cache.  To  implement  an  “allocate  on  write”  policy,  which  avoided  the  potentially 
severe  pipeline  stalling,  would  be  expensive  in  terms  of  on-chip  (silicon)  resources. 
Typically,  when  data  is  written-out  to  memory  it  is  no  longer  required,  as  compilers 
prefer  to  keep  critical  data  in  registers.  Thus,  typical  patterns  of  data  access  indicate 
that  data  written-out  should  not  cause  block  allocation  as  the  data  is  somewhat  less 
likely  to  be  accessed  again  in  the  near  future. 

When  stores  are  performed  on  data  which  is  not  currently  in  the  cache,  or  to  data 
which  is  supported  with  a  “write-through”  policy,  a  write-through  buffer  is  used  to 
assist  the  operation.  The  buffer  is  two  words  deep  and  holds  store-data  which  is 
waiting  for  access  to  the  memory  bus.  This  enables  the  processor  to  continue 
executing  new  instructions  and  not  wait  till  the  store  is  complete.  The  pipeline  only 
stalls  when  there  are  more  than  two  outstanding  stores  waiting  to  be  written  into 
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memory.  This  seldom  happens,  but  when  it  does,  the  write-buffer  which  normally 
has  the  lowest  priority  is  given  a  higher  priority  for  accessing  the  system  busses. 

Because  load  instructions  have  a  bigger  impact  on  performance  than  store 
instructions,  cache  reload  may  be  performed  before  the  write-buffer  is  emptied.  The 
Am29240  has  dependency  logic  to  detect  if  a  load  is  performed  on  a  data  address 
which  is  currently  pending  in  the  write-buffer.  The  data  is  forwarded  from  the 
write-buffer  when  necessary.  Because  the  Am29040  has  a  copy-back  rather  than 
write-through  policy,  the  write-buffer  is  first  flushed  before  loads  that  miss  in  the 
cache  are  performed  —  this  is  explained  in  the  later  Am29040  Microprocessor 
section. 

The  write-buffer  is  disabled  when  the  data  cache  is  disabled.  In  this  case  the 
processor  is  not  decoupled  from  the  performance  of  memory  writes.  Before  interrupt 
processing  commences  or  when  a  serializing  instruction  is  executed,  the  write  buffer 
is  flushed.  Additionally,  execution  of  LOADL  or  LOADSET  instructions  (which 
bypass  the  data  cached)  is  preceded  by  write-buffer  flushing.  Store  instructions  are 
properly  ordered,  and  since  the  STOREM  instruction  bypasses  the  write-buffer,  the 
buffer  is  emptied  before  the  STOREM  commences. 

Data  cache  reload,  resulting  from  a  load  access  which  missed,  always  fills  a 
complete  block.  The  process  of  reloading  the  cache  is  assisted  with  a  reload  buffer 
which  temporarily  holds  the  data  fetched  from  memory.  The  cache  reload  buffer  is 
four  words  deep.  When  the  buffer  is  full  it  is  transferred  into  the  cache  in  a  single 
cycle  when  the  cache  is  currently  not  being  accessed.  Code  continues  to  execute 
during  cache  reload;  and  the  cache  will  continue  to  service  cache  accesses  which  hit. 
However,  if  a  further  data  load  operation  is  performed  on  data  not  found  in  the  cache, 
the  processor  pipeline  will  stall  until  the  current  reload  operation  is  complete.  When 
the  reload  buffer  becomes  available  the  second  reload  operation  will  commence  (if 
necessary)  and  the  pipeline  will  restart  instruction  processing. 

The  following  sections  present  further  detail  about  data  caching  for  individual 
29K  family  members.  Table  5-3  summarizes  this  information. 

5.14.1  Am29240  Microcontroller 

A  block  diagram  of  the  Am29240  cache  architecture  is  shown  on  Figure  5-7. 
The  precise  cache  implementation  may  differ  from  the  diagram  but  the  data  flow 
paths  can  be  seen. 

A  buffered  “write-through”  policy  is  implemented  for  all  data  stores.  If  write 
data  matches  with  a  cached  entry,  then  the  cache  is  updated  during  the  same  cycle  as 
the  store.  All  stores  cause  writes  to  off-chip  memory,  but  the  write-through  buffer 
enables  the  processor  to  continue  code  execution  while  the  stores  are  completed  in 
parallel. 

The  cache  is  accessed  in  the  execute  stage  of  the  pipeline  even  if  address 
translation  is  in  use.  This  makes  data  that  hits  in  the  cache  available  for  the  instruction 
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Table  5-3.  Data  Cache  Comparison 


Processor 

Am29040 

Am  29240 

Am29243 

Addressing 

Physical 

Physical 

Cache  associativity 

2-way  set 

2-way  set 

Valid  bits  per  block 

1  bit 

1  bit 

Write-through  buffer 

2  words 

2  words 

Reload  buffer 

4  words 

4  words 

Copy-back  buffer 

4  words 

- 

Copy-back  policy 

Selectable 

No 

Write-through  policy 

Selectable 

Always 

Non  cachable  regions 

On  per-page  bases 

For  PIA  space 

Critical  word  first  reload 

No 

Yes 

Reload  memory  access 

Burst  mode 

Page  mode 

Bus  snooping 

Yes 

No 

LOADM  causes  reload 

No 

Yes 

Cache  locking 

Per-column 

Per-column 

Cache  block  allocation 

Only  on  LOAD 

Only  on  LOADfM] 

LOAD  hit  latency 

2-cycles 

1-cycle 

Replacement  selection 

Random 

Random 

following  the  load  without  any  pipeline  stalling.  However,  scheduling  of  load 
instructions  is  still  required  in  case  of  data  misses  which  are  still  subjected  to  the 
access  latencies  of  the  external  memory. 

Data  cache  reload  always  fills  a  complete  block.  The  format  of  the  cache  tag  and 
status  information  is  very  simple,  as  shown  in  Figure  5-8.  Reload  always  begins  with 
the  “critical  word  first”.  The  critical  word  is  the  word  containing  the  requested  data. 
The  critical  word  is  fetched  and  forwarded  to  the  appropriate  execution  unit  and  to 
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Figure  5-7.  Am29240  Microcontroller  Cache  Data  Flow 


the  cache  reload  buffer.  Reload  continues  with  the  remaining  words  in  the  block  and 
if  necessary  wraps  at  the  end  of  the  block  to  fill  the  remainder  of  the  block.  To 
increase  the  cache  reload  speeds,  the  processor  attempts  to  use  page-mode  accesses 
when  loading  from  DRAM.  Note,  burst-mode  addressing  can  not  be  used  as  the 
block  may  not  be  accessed  with  consecutive  addresses  due  to  critical  word  first 
reload. 


Address  Tag 


V 


Figure  5-8.  Am29240  Data  Cache  Tag  and  Status  bits 


The  processor  only  caches  accesses  made  to  DRAM  or  ROM  address  regions. 
The  write-through  policy  ensures  that  data  in  external  memory  is  always  consistent 
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with  data  held  in  cache.  Accesses  to  other  address  regions  or  on-chip  peripherals  are 
not  cached.  When  polling  the  status  of  a  peripheral  device,  it  is  important  that  status 
data  not  be  cached.  This  means  that  off-chip  peripherals  should  be  placed  in  PIA 
space  or  other  non  cached  space. 

When  developing  code  in  the  C  programming  language,  the  key  word  volatile 
can  be  used  to  indicate  that  data  should  not  be  held  in  internal  registers.  However,  this 
data  may  still  be  cached.  Hence,  marking  data  volatile  is  insufficient  to  ensure  that  it 
is  always  accessed  from  off-chip  memory.  If  memory  can  be  modified  by  some  other 
device,  either  via  dual-port  memory  or  external  DMA  controller,  it  is  important  that 
the  cache  be  kept  coherent  with  memory.  This  can  be  accomplished  by  signaling  the 
processor  when  a  DMA  type  transfer  is  complete.  The  processor  can  then  invalidate 
the  cache.  Because  the  cache  normally  contains  a  copy  of  the  memory  data  (due  to  the 
write-through  policy),  all  modifications  to  cached  data  are  already  reflected  in  the 
memory  state.  Note  that  marking  data  volatile  may  reduce  the  compilers  ability  to 
produce  highly  optimized  code,  as  load  scheduling  is  restricted  across  the  boundary 
created  by  a  volatile  memory  access. 

Cache  invalidation  due  to  DMA  type  access  can  be  avoided  if  the  data 
concerned  is  never  cached.  There  is  no  way  with  the  Am29240  microcontroller  of 
marking  the  data  as  non-cacheable.  However,  data  which  is  accessed  via  LOADL 
(load  and  lock)  instructions  is  never  allocated  for  cache  use.  A  convenient  way  of 
ensuring  that  the  compiler  only  generates  code  which  accesses  the  critical  data  with 
LOADL  and  STOREL  instructions  has  been  added  to  newer  versions  of  the  High  C 
compiler.  When  the  key  word  _LOCK  is  used  (along  with  volatile)  to  define  the  data 
type  of  a  variable,  LOADL  instructions  are  used  in  place  of  LOAD  when  accessing 
the  associated  data.  Consider  the  example  below: 


typedef  _LOCK  volatile  unsigned  char  UINT_8; 


unsigned  char  uart_data; 

/* 

cacheable  copy  of  UART 

data  */ 

UINT_ 

_8  *uart_p; 

/* 

uart_p 

must  hold  uart 

address  */ 

uart_ 

_data  =  *uart_p; 

/* 

access 

the  UART  */ 

If  the  _LOCK  volatile  approach  is  not  available,  it  may  be  possible  to  take  an 
object-orientated  approach  to  DMA  affected  data.  The  critical  data  could  be  only 
modified  with  an  object  member  function.  The  member  function  (probably  a  leaf) 
could  be  written  in  assembler  and  use  the  LOADL  instruction.  Of  course  using  such  a 
simple  function  to  perform  a  task  which  would  normally  be  accomplished  with 
in-line  code,  would  have  a  performance  impact.  However,  this  may  be  better  than 
invalidating  the  whole  cache  with  each  DMA  occurrence.  Note  that  directly  setting 
the  Lock  (LK)  bit  in  the  Current  Processor  Status  (CPS)  register  will  ensure  that  the 
Lock  pin  is  asserted  during  load  and  store  operations,  but  does  not  result  in  data  cache 
bypassing. 
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When  register  stack  filling  occurs,  a  LOADM  instruction  is  used  to  restore  local 
registers  which  were  previously  spilled.  The  data  loaded  during  the  filling  will  be 
allocated  in  the  cache  and  possibly  displace  other  cached  data.  However,  the  filled 
data  is  intended  for  the  register  file  only,  and  will  never  be  accessed  by  load  and  store 
instructions.  This  somewhat  reduces  the  effectiveness  of  the  cache;  but,  since 
register  stack  filling  is  a  very  infrequent  occurrence  it  is  unlikely  to  have  any  serious 
impact  on  performance. 

If  filling  always  occurred  in  Supervisor  mode,  it  would  be  very  easy  to  add  code 
to  the  fill_handler  (see  section  4.4.5)  which  disabled  the  data  cache  on  entry  and 
reenabled  the  cache  after  the  LOADM  instruction.  Valid  data  is  retained  in  the  cache 
while  it  is  disabled.  The  cache  is  disabled  and  enabled  in  Supervisor  mode  by 
respectively  setting  and  clearing  the  DD  bit  in  the  CFG  configuration  register.  This 
would  prevent  any  cached  data  being  replaced  by  the  fill  operation.  However,  filling 
is  normally  accomplished  by  trampolining  from  a  Supervisor  mode  trap  handler, 
FillTrap  (see  section  4.4.3),  to  the  User  mode  fill_handler.  This  introduces  a 
difficulty.  It  would  be  simple  to  disable  the  cache  in  the  FillTrap  code,  but  after 
returning  to  User  mode,  access  to  the  CFG  register  is  not  directly  permitted.  It  would 
be  possible  to  take  a  trap  at  the  end  of  fill_handler  to  reenter  Supervisor  mode, 
enable  the  data  cache  and  then  IRET  back,  but  it  seems  unlikely  that  the  additional 
overhead  (although  small)  would  produce  a  noticeable  performance  gain.  Another 
difficulty  with  temporarily  disabling  the  cache  is  that  an  interrupt  may  occur.  The 
interrupt  handler  or  operating  system  support  code  would  then  have  the  burden  of 
reenabling  the  cache.  However,  it  may  be  worthwhile  for  operating  system  code  to 
disable  the  data  cache  while  reloading  the  local  register  file  during  a  task  context 
restore. 

5.14.2  Am29040  2-bus  Microprocessor 

A  block  diagram  of  the  Am29040  cache  architecture  is  shown  on  Figure  5-9. 
The  precise  silicon  implementation  may  differ  from  the  diagram  but  the  data  flow 
paths  can  be  seen. 

The  default  policy  of  the  cache  is  “copy-back”  rather  than  “write-through”. 
Stores  do  not  always  cause  writes  to  off-chip  memory,  as  is  the  case  with  a 
write-through  policy.  Consider  when  a  currently  valid  cache  block  is  to  be 
reassigned  to  a  new  memory  location.  The  write-through  policy  enables  the  block  to 
be  simply  reallocated  without  having  to  copy  its  contents  to  memory.  The  copy-back 
policy  eliminates  the  need  to  write  all  stores  to  memory,  but  requires  that  reallocated 
blocks  be  copied-back  to  memory  before  they  can  be  used  for  higher  priority  data. 

To  improve  the  performance  of  the  copy-back  policy,  the  processor  has  a  four 
word  copy-back  buffer  which  is  loaded  in  a  single  cycle.  This  makes  the  selected 
block  immediately  available  for  reload.  The  copy-back  buffer  data  is  transferred  to 
memory  when  the  system  bus  becomes  available  —  certainly  after  reload  is 
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Figure  5-9.  Am29040  2-bus  Microprocessor  Cache  Data  Flow 


complete.  Thus,  loads  that  miss  in  the  cache  do  not  need  to  wait  for  a  block  to  be 
written  to  memory  before  the  data  memory  can  be  read. 

The  cache  is  accessed  in  the  write-back  stage  of  the  pipeline.  Tag  comparison 
and  any  required  address  translation  is  performed  during  the  execute  stage.  This 
makes  data  that  hits  in  the  cache  available  for  the  second  instruction  following  the 
load  without  any  pipeline  stalling.  Compared  to  the  Am29240  microcontroller,  this  is 
an  extra  cycle  of  latency.  The  reason  is  the  higher  internal  clock  speeds  of  the 
Am29040  processor.  Scheduling  of  load  instructions  is  always  required.  Particularly 
in  case  of  a  data  miss,  which  will  stall  the  pipeline  by  an  amount  increasing  with  the 
access  latencies  of  the  external  memory. 

Store  operations  that  hit  in  the  cache  also  require  two  cycles  to  complete.  To 
enable  the  cache  to  satisfy  a  load  instruction  which  follows  a  store,  the  second  cycle 
needed  for  cache  access  can  be  postponed.  Stores  that  hit  in  the  cache  make  use  of  the 
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write-through  buffer  just  like  stores  that  miss.  The  write-through  buffer  completes 
the  second  required  cycle  of  a  store  when  the  cache  is  free. 

Because  the  write-through  buffer  can  contain  data  for  a  store  that  hit  in  the 
cache,  the  write-buffer  must  be  flushed  before  cache  reload  can  be  performed.  To 
understand  this,  consider  that  the  write-buffer  may  contain  data  for  a  modified  block 
which  must  be  written  back  before  the  block  can  be  reallocated.  The  write-buffer  can 
not  forward  the  store  data  to  the  cache  block  after  it  has  been  assigned  to  a  new 
memory  address. 

Not  all  cache  blocks  need  to  be  written  back  to  the  system  memory.  The  format 
of  the  cache  tag  and  status  information  is  shown  on  Figure  5-10.  The  tag  information 
contains  a  Modify  (M)  bit.  When  a  block  is  first  reloaded  the  valid  bit  is  set  and  the  M 
bit  is  cleared.  If  a  store  (which  is  not  write-through)  is  performed  to  an  address  in  the 
block,  a  hit  occurs  and  the  cache  satisfies  the  access.  At  the  same  time  the  M  bit  is  set 
indicating  the  block  has  been  modified.  If  the  block  is  reallocated,  it  will  be  copied 
back  only  if  the  M  bit  is  set.  Otherwise  the  block  can  be  reloaded  without  the 
copy-back  being  performed. 


Address  Tag 

V 

S 

M 

Figure  5-10.  Am29040  Data  Cache  Tag  and  Status  bits 


Data  cache  reload  always  fills  a  complete  block.  Unlike  the  Am29240 
microcontroller,  reload  with  critical  word  first  is  not  performed.  The  processor  will 
use  burst  mode  when  reloading  a  block  and  will  start  with  the  first  word  in  the  block. 
When  the  critical  word  is  accessed  during  reload  it  is  forwarded  to  the  execute  unit. 
This  enables  reload  to  continue  in  parallel  with  code  execution.  If  the  critical  word 
had  been  accessed  first,  and  it  was  not  the  first  word  in  the  block,  burst  mode  access  to 
the  memory  block  would  have  to  be  disrupted.  This  would  increase  the  overall  reload 
time  and  would  be  particularly  noticeable  for  back-to-back  loads  which  miss  in  the 
cache.  Data  cache  reload  is  given  priority  over  instruction  cache  for  access  to  the 
system  busses.  Loads  issued  while  the  cache  is  disabled,  or  to  noncachable  data,  only 
fetch  the  critical  word  from  memory. 

There  is  a  minimum  access  latency  of  3-cycles  for  the  first  word  in  a  reloaded 
cache  block.  This  is  true  even  if  the  off-chip  memory  system  has  the  minimum  access 
latency  of  2-cycles.  When  a  block  is  reloaded  it  is  possible  the  block  will  be  supplied 
by  another  Am29040  processor  (via  data  intervention)  rather  than  the  memory 
system.  Data  intervention  is  not  asserted  until  the  third  cycle  after  the  address  of  the 
first  word  in  the  block  appears  on  the  address  bus.  The  memory  system  may  supply 
the  data  in  two  cycles,  but  the  processor  holds  the  data  internally  for  one  cycle  in  case 
data  intervention  occurs.  Because  cache  reload  is  always  block  orientated, 
intervention  only  occurs  with  the  first  word  of  the  block.  If  the  memory  system 
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latency  is  3-cycle  or  more,  the  processor  does  not  delay  the  forwarding  of  the  first 
data  word.  For  single-cycle  burst-mode  memories,  the  remaining  data  words  are  not 
delayed  internally  by  an  additional  cycle  (given  2-cycle  first  access)  unless  a 
load-multiple  is  being  performed.  Only  accessed  data,  rather  than  reloaded  data 
values,  are  sent  to  the  29K  data  channel. 

Peripheral  devices  such  as  a  UART  can  be  accessed  at  physical  memory 
locations  determined  by  specific  system  hardware.  Because  the  status  and  data  of  an 
external  device  can  change  at  any  time,  it  is  undesirable  to  cache  their  contents. 
Access  to  these  devices  is  normally  accomplished  in  Supervisor  mode.  On  entering 
Supervisor  mode  the  data  cache  could  be  disabled  by  setting  the  DD  bit  in  the  CFG 
configuration  register  (this  happens  automatically  if  the  FPD  bit  is  not  set  in  the  CFG 
register).  This  may  be  convenient  for  assembly  level  code  as  the  cache  may  only  be 
disabled  for  a  short  time.  Alternatively,  assembly  code  could  use  LOADL 
instructions  (which,  unlike  the  Am29240,  may  cause  block  allocation)  when 
accessing  peripherals.  The  LOADL  instruction  always  accesses  off-chip  memory. 
However,  if  operating  system  code  is  implemented  in  C  then  it  is  desirable,  for 
performance  reasons,  that  the  operating  system  data  also  be  cached.  A  note  of 
caution,  when  the  cache  is  disabled  its  contents  are  retained.  Consequently,  if 
memory  locations,  currently  cached,  are  modified  while  the  cache  is  disabled,  the 
cache  will  supply  out-of-date  data  when  it  is  reenabled.  This  must  be  avoided. 

The  key  word  volatile  can  be  used  in  C  to  indicate  that  data  should  not  be  held  in 
internal  registers.  However,  this  data  may  still  be  cached.  Hence,  marking  data 
volatile  is  insufficient  to  ensure  that  it  is  always  accessed  from  off-chip  memory.  As 
described  in  the  previous  Am29240  section,  defining  the  data  type  to  be  _LOCK 
volatile  is  one  way  of  instructing  the  compiler  to  use  only  LOADL  instructions  when 
accessing  peripherals.  The  Am29040  has  an  alternative;  the  MMU  can  be  configured 
to  disable  caching  of  selected  memory  pages.  This  means  the  operating  system  code 
(or  for  that  mater  User  mode  code)  must  run  with  address  translation  turned  on.  When 
TLB  reload  occurs,  the  memory  management  software  must  ensure  the  two-bit  field 
(PGM)  of  the  TLB  registers  is  set  to  “non  cachable”  for  memory  addresses 
containing  mapped  peripherals.  The  PGM  field  format  is  shown  on  Table  5-4. 

Table  5-4.  PGM  Field  of  the  Am29040  Microprocessor  TLB 


PGM1 

PGMO 

Effect 

0 

0 

normal  (copy-back) 

0 

1 

write-through 

1 

0 

reserved 

1 

1 

non  cachable 
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Data  loads  performed  to  memory  locations  which  are  marked  non  cachable  are 
not  subject  to  data  intervention.  There  is  never  any  internal  delaying  of  data  in  case 
data  intervention  occurs  late  in  the  access.  Hence,  the  critical  word  can  be  accessed  in 
a  minimum  of  2-cycles  —  given  a  2-cycle  memory  system.  This  requires  that  a  data 
region  which  is  not  cached  by  a  processors,  also  not  be  cached  by  any  other  processor 
(in  a  multiprocessor  system). 

With  virtual  addressing  in  use,  the  MMU  can  be  configured  to  select  a 
write-through  policy  on  a  per-page  bases.  The  write-through  policy  ensures  that 
data  in  external  memory  is  always  consistent  with  data  held  in  cache  as  all  load 
instructions  are  applied  to  memory  (and  to  the  cache  if  they  hit).  Selecting  this  policy 
for  all  memory  pages  would  result  in  a  poorer  overall  performance  compared  to  the 
default  copy-back  policy.  However,  regions  of  address  space  may  be  allocated  to 
peripherals  which  require  immediate  update.  For  example,  video  memory  should  be 
marked  as  write-through  rather  than  cachable.  There  is  a  definite  advantage  to 
accessing  video  frame  information  from  the  cache  when  manipulating  images. 
Additionally,  system  implementations  which  fail  to  deal  with  the  additional 
hardware  signals  needed  to  support  bus  snooping  may  use  write-through  access  to 
assist  with  cache  coherence  problems.  A  write-through  policy  would  only  enable  an 
external  agent  to  read  shared  data,  it  would  not  be  able  to  modify  the  data. 

To  achieve  the  best  performance,  application  code  will  likely  use  the  data  cache 
with  copy-back  operation  selected.  However,  there  are  situations  when  an 
application  will  prefer  write-through  cache  operation  —  at  least  for  portions  of  the 
memory  space.  Memory  locations  are  frequently  used  to  pass  data  between  operating 
system  and  application  code.  If  the  operating  system  where  to  use  copy-back  data 
cache  operation  (the  default),  there  would  be  a  danger  that  some  data  blocks 
(accessed  by  the  operating  system)  would  be  cached  and  their  M  bit  set;  later  when 
returning  to  the  application,  the  block  may  be  within  a  memory  page  which  is  marked 
write-through,  this  would  prevent  the  block  being  copied-back  should  the  block  be 
reallocated.  It  is  best  to  run  the  operating  system  with  address  translation  turned  on. 
This  enables  the  MMU  to  control  the  cache  operation  for  memory  pages  which  are 
jointly  accessed  by  the  operating  system  and  application  code.  To  simplify  this  task, 
the  configuration  register  has  an  Freeze-PD  (FPD)  bit.  When  this  bit  is  set  the 
Physical  Data  (PD)  bit  in  the  CPS  registers  is  not  set  when  the  operating  system  is 
entered  via  a  trap  or  interrupt.  The  FPD  bit  enables  the  PD  bit  to  remain  unchanged. 
Thus,  if  address  translation  was  enabled  in  the  application,  it  will  remain  enabled 
after  a  trap  or  interrupt.  The  data  cache  need  not  be  disabled  when  the  operating 
system  is  entered.  However,  the  MMU  must  ensure  a  consistent  cache  policy  for 
memory  pages  which  are  jointly  accessed  by  operating  system  and  application  code. 

The  Am29040  processor  deals  with  other  agents,  such  as  a  DMA  controller, 
accessing  the  same  memory  by  performing  bus  snooping.  Multiprocessor  designs, 
with  on-chip  caches,  are  also  supported  by  the  snooping  protocol.  The  Shared  (S)  bit 
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in  the  cache  tag  is  used  to  support  the  protocol.  The  S  bit  becomes  set  when  a  write  is 
performed  to  an  address  which  causes  block  reload,  and  the  block  is  supplied  by 
another  cache  —  more  on  this  in  the  following  Cache  Consistency  section. 

The  Am29040  processor,  unlike  the  Am29240,  does  not  allocate  cache  blocks 
for  data  fetched  with  a  LOADM  instruction.  This  prevents  needless  displacement  of 
valid  cache  blocks  when  a  register  stack  fill  is  performed.  Block  allocation  during  a 
LOADM  in  conjunction  with  a  copy-back  policy  would  have  poor  performance 
given  that  the  copy-back  buffer  is  only  four  words  deep.  The  copy-back  buffer  and 
the  LOADM  instruction  would  both  be  competing  for  access  to  the  system  bus. 

5.14.3  Cache  Locking  and  Invalidating 

Valid  data  cache  blocks  can  be  locked  by  appropriately  setting  the  DL  field  of 
the  CFG  configuration  register.  The  entire  cache  can  be  locked  or  only  column  0.  If  a 
block  is  locked  but  still  invalid,  it  can  be  allocated  for  caching.  Critical  data  can  be 
placed  in  the  cache  by  first  locking  the  cache  and  then  loading  the  required  data.  This 
effectively  turns  the  cache  into  a  small  fast  RAM  for  critical  data.  (However,  a 
write-through  policy,  if  used,  will  still  cause  all  writes  to  be  duplicated  in  off-chip 
memory).  If  only  column  0  is  locked  the  remaining  column  1  will  still  cache  entries 
with  a  direct-mapping  replacement  scheme.  Typical  applications  show  best 
performance  when  the  cache  is  not  locked  but  left  to  the  default  scheme  of  caching 
the  most  recently  accessed  data. 

The  cache  can  be  invalidated  in  a  single  cycle  by  issuing  an  INV  or  IRETINV 
type  instruction.  All  blocks  are  marked  invalid  during  this  process  unless  the  cache  is 
locked.  A  locked  cache  can  only  be  marked  invalid  if  it  is  first  disabled  before 
invalidating. 

The  copy-back  policy  of  the  Am29040  makes  cache  invalidation  more  difficult. 
Valid  cache  blocks  which  have  been  modified  can  not  be  simply  marked  invalid. 
Failure  to  write-back  modified  blocks  would  leave  the  memory  in  an  out-of-date 
state.  Because  the  data  cache  operates  with  physical  address  tags  and  performs  bus 
snooping,  there  is  very  little  reason  to  invalidate  the  cache.  Cache  invalidation  can  be 
safely  performed  by  using  the  cache  interface  registers  (CIR  and  CDR)  to  examine 
each  block  to  determine  if  the  block  is  valid  and  if  the  modified  bit  (M  bit)  is  set. 
When  set,  the  block  must  be  written  out  to  memory  before  an  INV  type  instruction  is 
used. 

5.14.4  Cache  Consistency 

The  Am29040  is  currently  the  only  processor  in  the  29K  family  which  contains 
on-chip  data  cache  consistency  hardware.  Cache  consistency  becomes  an  issue 
when  there  is  more  than  one  cache  in  a  multiprocessor  system  or  when  a  DMA  type 
device  is  also  accessing  data  regions  which  are  cached.  When  there  is  more  than  one 
agent  trying  to  access  data,  it  is  important  that  all  agents  agree  upon  a  single  (and 
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most  recent)  value.  A  solution  to  this  problem  is  for  each  processor  to  make  virtual 
address  access  to  the  shared  memory  pages  and  mark  the  pages  as  non  cachable. 
However,  unless  all  memory  is  marked  non  cachable  the  plan  would  require  that 
software  arrange  for  data  intended  for  shared  memory  to  appear  in  a  range  of 
contiguous  non  cached  memory.  There  would  need  to  be  an  agreement  with  the 
operating  system  that  the  selected  address  range  was  not  to  be  cached.  Such  a 
mechanism  would  be  undesirable,  inflexible,  and  difficult  to  retrofit  to  existing 
software. 

With  systems  incorporating  multiple  Am29040  processors,  each  processor  may 
cache  the  same  memory  location.  This  is  desirable,  as  access  to  the  cache  is  much 
faster  than  off-chip  access.  The  processor  supports  three  interface  signal  pins  which 
facilitate  “bus  watching”  for  data  reads  with  cache  block  granularity.  The  technique 
requires  little  software  support,  and  existing  programs  can  benefit  without  any 
modifications.  The  on-chip  protocol  supporting  the  interface  signals  ensures  that 
each  memory  access  is  consistent. 

When  a  load  is  performed,  all  processors  watching  the  bus  determine  if  they 
have  a  currently  cached  copy  of  the  requested  data.  If  they  do,  they  assert  the  HIT 
signal  pin.  The  protocol  will  enable  one  cache  to  identify  itself  as  the  owner  of  the 
data.  This  cache  will  assert  both  the  HIT  and  the  DI  (“data  intervention”)  signals.  The 
processor  requesting  the  load  is  satisfied  by  the  intervening  cache.  The  load  will 
cause  a  block  to  be  allocated  with  the  S  bit  set  in  the  tag.  This  indicates  the  data  is 
shared.  The  processor  can  continue  to  access  the  data  from  the  cache.  Additionally, 
all  processors  asserting  the  HIT  signal  will  realize  that  another  processor  is  sharing 
the  data  and  will  set  the  S  bit  in  their  cached  copy.  If  any  processor  modifies  a  block 
tagged  with  the  same  address,  that  processor  will  perform  a  “write  broadcast”  as  a 
result  of  the  S  bit  being  set.  This  does  not  cause  the  system  memory  to  be  updated,  but 
enables  the  snooping  processors  to  update  their  cached  copies.  A  processor  asserts 
the  WBC  signal  pin  during  the  write  broadcast  and  becomes  the  owner  of  the  shared 
block.  The  processor  will  remain  the  owner  of  the  block  until  another  processor  gains 
ownership  by  performing  a  write  broadcast  itself.  When  a  processor  performs  a  write 
broadcast  it  checks  to  see  if  another  processor  is  asserting  the  HIT  signal,  if  not  then 
the  processor  realizes  it  is  now  the  only  processor  caching  the  data  and  therefor  clears 
the  S  bit. 

To  summarize,  bus  watching  of  reloads  is  used  to  detect  sharing  of  data.  When 
data  is  shared  all  caches  set  the  S  bit  in  the  cached  block.  The  processor  which 
satisfied  the  block  reload  (in  place  of  the  memory)  is  the  owner  of  the  block  and  has 
the  S  and  M  (modified)  bits  set  in  the  block  tag.  Writes  to  shared  data  create  write 
broadcasts  on  the  bus  to  inform  other  caches  of  the  change  of  value.  Ownership  of  a 
block  is  transferred  to  the  processor  performing  the  write  broadcast.  Cache  to  cache 
communications  via  write  broadcasts  is  a  lot  faster  than  accessing  slower  system 
memory. 
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Bus  watching  monitors  write-through  and  copy-back  of  cached  data.  Memory 
regions  which  are  accessed  as  write-through  never  have  cache  blocks  which  are 
modified  (that  is,  their  tag  M  bit  is  never  set).  All  writes  to  such  regions  are  performed 
to  the  system  memory.  Caches  with  matching  blocks  will  update  their  data  when  the 
write-through  takes  place.  Only  blocks  which  have  been  modified  get  copied  back 
when  the  block  is  reallocated.  When  a  block  is  copied-back,  other  caches  will  retain 
their  clean  copies  of  the  shared  data.  There  will  now  be  no  owner  of  the  data.  If 
another  cache  performs  a  load  for  the  data,  no  processor  will  intervene  and  the  data 
block  will  be  fetched  from  memory.  The  data  consistency  protocol  is  some  times 
referred  to  as  a  “MOESI”  protocol  (reflecting  the  five  states:  Modified,  Owned, 
Exclusive,  Shared,  and  Invalid). 

The  Am29040  processor  supports  an  optimization  for  use  with  binary 
semaphores.  They  are  frequently  used  to  enable  or  disable  access  to  shared  resources. 
A  processor  can  gain  exclusive  access  to  a  resource  via  the  LOADSET  instruction. 
The  instruction  atomically  loads  the  value  from  the  semaphore  memory  location  and 
then  writes  the  set-value  (Oxffff,ffff)  to  the  location.  The  loaded  value  can  then  be 
tested;  if  it  was  already  set,  access  is  disallowed.  Access  to  a  shared  resource  is 
granted  when  a  zero  semaphore  is  read.  The  process  of  accessing  the  semaphore  with 
a  LOADSET  instruction  disables  allocation  of  the  resource  to  other  requesting 
agents.  When  acquiring  unset  semaphores,  the  processor  maintains  exclusive  control 
of  the  system  bus. 

When  access  is  not  granted,  a  processor  will,  typically,  repeatedly  access  the 
semaphore  waiting  for  it  to  become  unset.  However,  continually  polling  a  memory 
location  which  is  held  in  shared  memory  can  be  a  serious  performance  problem.  To 
prevent  the  associated  bus  activity,  the  Am29040  can  cache  binary  semaphores.  If  a 
processor  busy-waits,  the  semaphore  traffic  is  isolated  to  the  processors  data  cache. 
Additionally,  when  a  semaphore  value  is  found  to  be  set,  further  LOADSET 
instructions  are  not  granted  access  to  the  external  bus  until  the  semaphore  is  cleared. 
The  processor  knows  the  semaphore  is  set  by  testing  bit-31  of  the  cached  value;  in 
such  case  there  is  no  need  to  perform  the  SET  portion  of  the  LOADSET  as  the 
semaphore  is  already  set.  The  processor  currently  holding  access  to  the  semaphore 
will  perform  a  write  broadcast  when  it  releases  the  shared  semaphore.  A  STOREL 
atomic  instruction  is  used  to  clear  the  semaphore  value.  The  STOREL  instruction, 
like  a  LOADSET  to  an  unset  semaphore,  has  exclusive  control  of  the  bus  during  its 
execution.  The  mechanism  ensures  that  at  any  time  only  one  processor  can  gain 
access  to  a  shared  resource. 

5.15  SELECTING  AN  OPERATING  SYSTEM 

I  am  often  asked  by  engineers  about  to  start  a  29K  project,  what  they  should 
look  for  when  selecting  an  operating  system.  There  are  a  number  of  companies 
offering  operating  systems  with  a  range  of  different  capabilities;  alternatively  a 
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home-grown  system  could  be  constructed.  The  material  covered  in  this  chapter  and 
others  should  help  in  either  constructing  or  selecting  a  suitable  operating  system.  I 
would  certainly  advice  seriously  considering  purchasing  rather  than  constructing. 
The  task  may  be  enjoyable  but  probably  more  lengthy  than  most  project  time  tables 
will  allow.  However,  for  those  who  insist  on  building  their  own  operating  system, 
AMD  has  a  collection  of  useful  routines  which  make  a  good  starting  point.  Contact 
AMD  29K  customer  support  for  a  copy  of  the  code. 

There  is  usually  no  one  right  operating  system.  The  choice  depends  on  a  number 
of  criteria  which  may  vary  from  project  to  project.  The  following  list  presents  several 
questions  which  you  need  to  ask  yourself  and  possibly  operating  system  vendors. 
You  can  decide  the  importance  of  each  item  with  regard  your  project  requirements. 

■  Are  3-bus  family  members  as  well  as  2-bus  members  supported?  If  the 
Am29000  or  Am29050  processors  are  to  be  used,  and  the  data  bus  and 
instruction  bus  are  not  to  be  tied  together,  then  the  operating  system  must  be 
clear  about  maintaining  code  and  data  in  separate  regions.  The  Harvard 
architecture,  supported  by  3-bus  memory  systems,  typically  achieves  a  20% 
performance  gain  over  2-bus  memory  systems.  Additionally,  when  3-bus 
systems  are  supported,  the  operating  system  may  require  the  support  of  a 
hardware  bridge  allowing  the  instruction  memory  to  be  reached  (usually  with 
access  delays)  via  a  data  memory  access. 

■  Are  interruptible  SPILL  and  FILL  code  supported?  By  running  them  with 
interrupts  disabled  the  difficulties  of  performing  repair  of  the  register  stack 
support  registers  can  be  avoided,  should  they  be  interrupted.  However,  they 
require  the  support  of  multi-cycle  LOADM  and  STOREM  instructions,  which 
results  in  increased  interrupt  latency.  Additionally,  SPILL  and  FILL  support 
with  interrupts  disabled,  results  in  a  larger  overhead  compared  with 
trampolining  to  support  routines;  thus  it  is  non-optimal  as  SPILLing  and 
FILLing  occur  a  lot  more  often  then  their  interruption. 

■  Given  that  SPILL  and  FILL  are  interruptible,  their  operation  is  interdependent 
with  the  longjmpO  library  routine  and  the  signal  trampoline  code.  All  four  of 
these  services  must  coordinate  their  manipulation  of  register  stack  support 
registers  if  interrupts  are  to  be  reliably  supported. 

■  Some  operating  systems  support  nested  interrupts,  others  do  not;  without  nested 
interrupt  support,  interrupt  latency  can  be  increased.  The  use  of  kernel  threads 
to  complete  interrupt  processing  is  one  way  to  keep  down  latency.  If  interrupt 
handlers  are  to  be  written  in  a  high  level  language  such  as  C,  it  may  be  desirable 
to  support  Freeze  mode  handlers  in  C.  This  greatly  reduces  the  interrupt  support 
overhead,  because  the  overhead  of  preparing  the  register  stack  for  use  by 
non-leaf  procedures  is  relatively  high.  Does  the  operating  system  under 
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consideration  use  interrupt  tagwords  to  support  interrupt  context  caching  for 
Freeze  mode  handlers? 

An  interrupt  can  be  configured  to  generate  a  task  context  switch,  the  new  task 
being  responsible  for  completing  interrupt  processing.  This  method  has  a 
greater  overhead  associated  with  it  than  processing  the  interrupt  in  the  context 
of  the  interrupted  task.  Task  context  switching  requires  the  register  cache  to  be 
flushed  and  reloaded  with  the  incoming  task’s  register  data.  A  C-level  interrupt 
handler  can  use  the  stack  cut-across  technique  to  avoid  flushing  the  register 
cache.  Certainly  some  interrupts  must  cause  task  context  switching  to  occur,  but 
it  is  best  to  avoid  this  approach  as  a  general  mechanism  for  dealing  with 
interrupts.  Additionally,  if  tasks  run  in  User  mode,  the  instruction  cache  must  be 
flushed  on  a  task  context  switch.  It  is  best  to  reduce  the  number  of  cache  flushes 
due  to  interrupt  support. 

If  the  system  is  to  support  a  high  interrupt  throughput,  then  processing  interrupts 
with  a  Dispatcher  will  be  more  efficient.  The  Dispatcher  can  execute  in 
assembly  level  or  C  level.  If  C,  then  the  interrupted  register  stack  condition  need 
only  be  repaired  once  before  entering  the  Dispatcher,  rather  than  for  every 
interrupt  (see  section  2.5.6). 

Interrupt  latency  can  be  reduced  if  Freeze  mode  interrupt  processing  is  never 
disabled.  For  a  HIF  conforming  operating  system,  the  technique  was  described 
in  section  2.5.7  ( Minimizing  Interrupt  Latency). 

Synchronous  context  switching  times  are  greatly  improved  by  only  restoring 
the  activation  record  of  the  procedure  about  to  start  execution.  This  can  only  be 
done  for  tasks  which  were  synchronously  switched  out;  but  is  a  better  method 
than  restoring  the  register  stack  to  the  exact  position  in  use  at  the  time  of  the  task 
context  save. 

Many  embedded  operating  systems  run  tasks  in  Supervisor  mode  rather  than 
User  mode.  This  gives  each  task  direct  access  to  critical  resources,  there  is  no 
need  to  use  system  calls  (which  use  a  trap  instruction  to  enter  Supervisor  mode) 
to  gain  access  to  restricted  resources.  Always  running  in  Supervisor  mode  has 
the  additional  advantage  that  the  Instruction  cache  need  not  be  flushed  on  a  task 
context  switch.  However,  the  benefits  of  memory  access  protection  are  typically 
reduced  or  unavailable  with  such  systems. 

Operating  systems  each  have  their  own  system  call  interface  which  is  usually  a 
little  different  from  HIF  (see  Appendix  C).  However,  it  may  be  still  useful  to 
have  HIF  services  available.  The  HIF  services  can  often  be  supported  by 
translating  them  into  the  underlying  operating  system  serves.  The  High  C  29K 
and  GNU  library  services  generate  HIF  service  calls.  These  libraries  can  be  used 
with  a  non-HIF  operating  system;  but  care  must  be  taken  as  library  routines 
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such  as  printf()  are  not  reentrant.  The  OS-boot  operating  system,  most  often 
used  with  HIF  conforming  library  services,  does  not  support  task  switching,  but 
other  operating  systems  will,  and  the  reentrant  library  procedure  limitations  will 
become  a  problem. 

Not  all  members  of  the  29K  family  support  floating-point  instructions  directly 
in  hardware.  It  is  the  operating  system’s  responsibility  to  ensure  that  the  desired 
floating-point  emulation  routines  (trapware)  are  installed.  The  operating 
system  vendor  should  also  supply  the  appropriate  transcendental  library 
services  (sin(),  cos(),  etc.)  for  the  chosen  processor. 

Floating-point  instruction  emulation  is  typically  configured  to  operate  with 
interrupts  not  enabled.  This  avoids  the  need  to  save  interrupted  floating-point 
context.  However,  the  addition  of  floating-point  environment  saving  during 
application  context  switching  is  a  requirement  for  some  systems  and  an 
unwanted  burden  for  other  systems.  It  is  worth  knowing  the  options  an 
operating  system  supports  in  this  area. 

It  is  often  desirable  and  less  expensive  to  purchase  an  operating  system  in 
linkable  or  binary  form,  rather  than  source.  This  makes  it  more  difficult  to  make 
changes  to  the  operating  system  code;  this  can  be  required  to  incorporate 
support  for  specialized  peripheral  devices.  It  is  best  that  the  operating  system 
not  consume  all  of  the  32  global  registers  assigned  for  operating  systems  use 
( gr64-gr95 ).  Additionally,  linkable  operating  system  images  can  use  link-time 
register  assignment  rather  than  compile  time.  This  enables  the  user  to  rearrange 
the  global  register  usage  and  utilize  unassigned  registers  for  peripheral  support 
tasks. 

The  29K  family  has  no  hierarchical  memory  management  unit  policy  built  into 
the  hardware.  Support  of  the  translation  look-aside  buffers  is  left  to  software. 
This  offers  great  flexibility,  but  generates  questions  about  the  MMU  support 
policy  adopted  by  the  operating  system.  Even  if  address  translation  is  not 
supported  by  an  operating  system,  it  is  still  desirable  to  use  the  MMU  hardware 
(where  available)  to  support  address  access  protection  with  one-to-one  address 
translation. 

There  is  a  movement  in  the  operating  system  business,  which  includes  real-time 
variants,  to  support  POSIX  conforming  system  calls.  It  may  be  worth  knowing 
how,  and  to  what  extent,  the  operating  system  vendor  plans  to  support  POSIX. 

Support  for  debugging  operating  system  activity  and  application  code  is  very 
important.  Often  operating  systems  have  weaknesses  in  this  area.  The 
Universal  Debug  Interface  (UDI)  has  been  influential  in  the  29K  debug  tool 
business.  It  offers  flexibility  in  debug  tool  configuration,  flexibility  and 
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selection.  Debug  tools  are  generally  more  available  for  DOS  and  UNIX  based 
cross  development  environments. 


5.16  SUMMARY 

Typical  RISC  processors,  including  the  29K,  require  more  complex  system 
software.  The  manageability  of  such  software  development  is  very  much  a  function 
of  the  particular  RISC  processor  implementation.  Increased  knowledge  of  how  the 
compiler  utilizes  the  processor  registers  is  required  to  achieve  best  performance.  The 
availability  of  a  large  number  of  internal  registers  leads  to  improved  operation 
speeds;  although  the  performance  gains  are  at  the  cost  of  a  somewhat  more  complex 
application  task  context  switch. 

The  use  of  interrupt  processing  via  lightweight  interrupts  and  signal  handling 
methods,  along  with  the  relative  infrequency  of  context  switching,  enable  the  system 
designer  to  implement  a  supervisor  of  generally  much  improved  performance, 
vis-a-vis  CISC  processors.  Fortunately,  application  developers  can  make  use  of 
RISC  technology  without  having  to  solve  the  supervisor  design  problems 
themselves,  as  there  are  a  number  of  operating  system  products  available. 


294 


Evaluating  and  Programming  the  29K  RISC  Family 


Chapter  6 


Memory  Management  Unit 


Address  values  generated  by  the  program  counter  and  data  load  and  store  opera¬ 
tions  appear  on  the  Am29000  processor  address  bus.  Certain  members  of  the  29K 
family  contain  instruction  caches,  which  eliminates  the  need  for  the  processor  to  re¬ 
quest  instructions  from  external  memory  when  the  required  instruction  can  be  ob¬ 
tained  from  the  cache.  However,  unless  the  Memory  Management  Unit  (MMU)  is  in 
operation,  address  values  will  flow  directly  on  to  the  pins  assigned  to  the  address  bus. 

The  MMU  enables  address  values  to  be  translated,  to  some  extent,  into  a  differ¬ 
ent  physical  address.  This  means  that  the  address  values  generated  by  a  program  need 
not  directly  correspond  to  the  physical  address  values  which  appear  on  the  chips  ad¬ 
dress  pins.  The  program  generates  virtual  addresses  for  data  and  instructions  which 
are  located  in  physical  memory  at  addresses  determined  by  the  MMU  address 
translation  hardware. 

With  the  Am29000  processor,  virtual  address  space  is  broken  into  pages  of  IK 
byte,  2K  byte,  4K  byte  or  8K  byte  size.  The  first  page  begins  at  address  0  and  subse¬ 
quent  pages  are  aligned  to  page  boundaries.  The  MMU  does  not  modify  the  lower 
address  bits  used  to  address  data  within  a  page.  For  example,  with  a  4K  page  size,  the 
lower  12  address  bits  are  never  modified.  However,  the  MMU  translates  the  upper  20 
virtual  address  bits  into  a  new  20-bit  value.  The  translated  upper  20-bits  and  the 
original  lower  12-bits  are  combined  to  produce  a  32-bit  physical  address  value. 

The  use  of  an  MMU  enables  a  program  to  appear  to  have  memory  located  over 
the  complete  32-bit  virtual  address  space  (4G  bytes).  The  physical  memory  system 
is,  of  course,  much  smaller.  Virtually  addressed  pages  are  mapped  (via  address 
translation)  into  physical  pages  located  in  the  available  memory,  typically  1M  to  4M 
bytes.  A  secondary  memory  is  used  to  store  virtually  address  pages  which  are  not  cur¬ 
rently  located  in  the  physical  memory  due  to  its  limited  size. 
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The  secondary  memory  is  typically  a  disk.  When  the  MMU  identifies  the  pro¬ 
gram’s  need  to  access  data  stored  on  a  page  currently  out  on  disk,  it  must  instruct  the 
operating  system  to  page-in  the  required  page  into  the  physical  memory.  The  page 
may  be  located  almost  anywhere  in  physical  memory,  but  the  address  translation  ca¬ 
pability  of  the  MMU  will  make  the  page  appear  at  the  desired  virtual  address  ac¬ 
cessed  by  the  program.  In  the  process  of  paging-in  from  disk,  the  operating  system 
may  have  to  page-out  to  disk  a  page  currently  located  in  physical  memory.  In  this 
way  memory  space  is  made  available  for  the  in-coming  page. 

Within  the  29K  family,  the  MMU  unit  is  located  on-chip,  and  is  constructed  us¬ 
ing  Translation  Look-Aside  Buffers  (TLBs).  This  chapter  describes  in  detail  how  the 
TLB  hardware  operates,  and  how  it  can  be  used  to  implement  a  virtual  address  capa¬ 
bility.  The  TLBs  provide  other  functions  in  addition  to  address  translation,  such  as 
separate  access  permissions  for  data  read,  write  and  instruction  execution.  These  im¬ 
portant  functions  will  be  explained  and  highlighted  in  example  code. 

6.1  SRAM  VERSUS  DRAM  PERFORMANCE 

As  already  stated,  secondary  memory  is  typically  disk.  However,  it  is  difficult  to 
show  example  code  relying  on  disk  controller  operation.  The  example  code  would  be 
too  large  and  too  much  time  would  be  spent  dealing  with  disk  controller  operation. 
This  is  not  our  intention.  I  have  chosen  to  use  SRAM  devices  for  physical  memory 
and  DRAM  and  EPROM  devices  have  been  chosen  to  play  the  role  of  secondary 
memory. 

SRAM  devices  are  much  faster  than  most  DRAM  memory  system  arrange¬ 
ments.  Thus,  by  paging  the  program  into  SRAM,  a  very  desirable  speed  gain  should 
be  obtained.  Certainly  the  secondary  memory  capacity  is  limited  to  the  typically  1M 
to  4M  bytes  made  available  by  the  DRAM  and  EPROM  combination.  But  programs 
will  execute  from  SRAM  alone,  which  may  be  limited  to  as  little  as  128K  bytes.  For 
large  programs  this  is  likely  to  result  in  SRAM  pages  being  paged  out  to  secondary 
DRAM  to  make  space  available  for  incoming  pages. 

The  SRAM  will  effectively  be  a  memory  cache  for  the  secondary  DRAM;  the 
Am29000  processor  MMU  being  used  to  implement  a  software  controlled  cache 
mechanism.  The  performance  difference  shown  by  programs  executing  from  SRAM 
versus  DRAM  is  large.  Figure  6- 1  shows  the  average  cycles  required  per  instruction 
execution  for  four  well  know  UNIX  utility  programs.  The  influence  of  memory  per¬ 
formance  on  these  benchmarks  is  likely  to  be  similar  to  that  experienced  by  large  em¬ 
bedded  application  programs.  The  DRAM  memory  system  used  is  termed  4-1 .  This 
terminology  is  used  throughout  this  chapter.  In  this  case  it  means  the  memory  system 
requires  four  cycles  for  a  random  access  and  one  cycle  for  a  burst-mode  access. 

Burst-mode  enables  multiple  instructions  and  data  to  be  accessed  consecutively 
after  a  single  start  address  has  been  supplied.  The  first  data  or  instruction  word  in  the 
burst  suffers  the  access  penalties  of  a  random  access,  but  subsequent  accesses  are 


296 


Evaluating  and  Programming  the  29K  RISC  Family 


much  less  expensive  in  terms  of  access  delay  cycles.  The  external  memory  system  is 
responsible  for  generating  access  addresses  after  the  processor  has  supplied  the  start 
address  for  the  burst.  This  can  be  simply  achieved  with  an  address  latch  and  counter. 
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compress  diff  nroff  assembler 
Average  cycles  per  instruction 

□  Joint  I/D  4-1  DRAM  memory  system 

□  Separate  1/D 

Figure  6-1.  Average  Cycles  per  Instruction  Using  DRAM 

The  Am29000  processor  can  execute  a  new  instruction  every  cycle  if  supported 
by  the  memory  system.  Figure  6- 1  shows  that  the  desired  1  cycle  per  instruction  is  far 
from  achievable  by  the  utility  programs  using  a  4-1  memory  system.  Certain  mem¬ 
bers  of  the  29K  family  (the  Am29000  and  the  Am29050  processors)  support  a  3-bus 
architecture.  One  bus  is  used  for  physical  address  values,  and  there  are  separate 
busses  for  instruction  and  data  information.  This  bus  structure  allows  simultaneous 
instruction  and  data  transfer.  Once  the  address  bus  has  been  used  to  supply  the  start 
address  of  an  instruction  burst,  the  address  bus  is  free  for  use  in  random  or  burst¬ 
mode  data  accesses.  Figure  6-1  shows  performance  values  for  both  separate  (sepa¬ 
rate  I/D),  and  joint  instruction  and  data  (joint  I/D)  busses.  It  can  be  clearly  seen  that 
separate  busses  offer  a  significant  performance  gain.  Figure  6-2  shows  the  average 
cycles  per  instruction  for  the  same  four  benchmarks  executing  on  a  2-1  memory  sys¬ 
tem. 

Implementing  a  2-1  memory  system  at  25M  Hz  processor  speeds,  in  particular 
obtaining  a  2-cycle  first  access,  requires  SRAM  memory  devices.  The  results  on 
Figure  6-2  show  that  1-cycle  per  instruction  is  almost  achieved  when  a  separate 
instruction  and  data  bus  is  used  with  2-1  memory. 

29K  family  members  supporting  separate  busses  do  not  have  any  means  within 
the  chip  of  reading  data  which  is  located  in  instruction  memory.  If  instructions  and 
data  are  to  be  located  in  the  same  memory  pages,  then  an  off-chip  bridge  must  be 
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Figure  6-2.  Average  Cycles  per  Instruction  Using  SRAM 


constructed  between  the  the  data  and  instruction  busses.  Accessing  data  located  in 
the  instruction  memory  system  via  the  bridge  connected  to  the  data  bus,  will  require 
more  access  cycles  than  accessing  data  located  in  the  data  memory  system  connected 
to  the  data  bus  directly.  The  bridge  could  support  accessing  instructions  located  in 
data  memory,  but  the  performance  penalties  seem  too  great  to  implement.  The  bridge 
mechanism  is  acceptable  if  used  for  the  occasional  read  of  data  located  in  EPROM 
attached  to  the  instruction  bus.  It  can  also  be  used  for  reading,  as  data,  an  instruction 
which  has  caused  an  execute  exception  violation. 

The  construction  of  two  memory  systems,  one  for  data  and  a  second  for  instruc¬ 
tions,  is  undesirable.  But  it  does  allow  a  performance  gain.  This  chapter  shall  deal 
with  an  example  system  with  a  joint  I/D.  This  is  because  the  code  example  is  simpli¬ 
fied.  A  separate  I/D  memory  system  would  require  separate  instruction  and  data 
memory  caches  and  associated  support  data  structures.  A  block  diagram  of  the  exam¬ 
ple  system  is  shown  in  Figure  6-3. 

Even  with  a  joint  I/D  memory  system  it  may  still  be  necessary  to  build  two 
memory  systems  to  achieve  a  low  number  of  cycles  per  instruction.  This  is  because  it 
is  difficult  to  achieve  single  cycle  burst-mode  access  with  current  memory  devices  at 
25M  Hz  processor  rates.  Two  memory  systems  are  required  and  are  used  alternative¬ 
ly.  This  technique  is  often  called  memory  system  interleaving.  One  memory  system 
supplies  words  lying  on  even  word  boundaries  and  the  second  memory  system  sup¬ 
plies  words  lying  on  odd  word  boundaries.  In  this  way  each  memory  system  has  twice 
as  many  cycles  to  respond  to  consecutive  memory  accesses  compared  to  a  single 
memory  system  acting  alone. 
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Figure  6-3.  Block  Diagram  of  Example  Joint  I/D  System 


Interleaving  can  not  guarantee  a  faster  random  or  burst-mode  first  access,  be¬ 
cause  the  first  access  can  not  be  overlapped  with  an  another  access  in  the  way  achiev¬ 
able  by  consecutive  burst-mode  accesses.  However,  some  implementations  may 
achieve  some  savings  if  the  first  access  happens  to  fall  to  the  memory  system  which 
did  not  provide  the  previous  access. 

With  joint  I/D  systems,  4  cycle  first  access  is  very  punishing  on  performance. 
This  is  because  instruction  bursts  must  be  suspended  when  a  data  access  occurs.  To 
start  a  data  access  costs  4  cycles.  After  it  has  completed,  the  joint  I/D  bus  can  restart 
the  instruction  burst  at  a  cost  of  4  cycles.  Thus  accessing  a  single  data  word  will  effec¬ 
tively  cost  8  cycles.  The  4  cycle  memory  response  latency  is  hidden  by  the  branch 
target  cache  (BTC)  for  branches  and  calls  but  not  interruption  of  contagious  instruc¬ 
tion  execution.  Separate  I/D  systems  do  not  suffer  to  the  same  extent  from  memory 
latency  effects,  as  the  instruction  bus  can  continue  to  supply  instructions  in  parallel 
with  the  data  bus  operation.  Members  of  the  29K  family,  such  as  the  Am29030  pro¬ 
cessor,  which  only  support  joint  I/D  systems,  have  instruction  cache  memory  on- 
chip  rather  than  BTC  memory.  This  will  enable  the  effects  of  instruction  stream  inter¬ 
ruption  to  be  better  hidden,  as  the  on-chip  cache  can  be  used  to  restart  the  instruction 
stream  after  data  access  has  occurred. 

Figure  6-4  shows  average  cycles  per  instruction  for  the  four  benchmark  pro¬ 
grams  running  on  various  joint  I/D  memory  systems.  The  4-2  DRAM  system  does 
not  support  single  cycle  burst-mode  (2-cycle  burst),  and  the  performance  reduction 
from  a  4-1  DRAM  system  is  apparent.  The  MMU  and  associated  software  will  be 
used  in  the  example  system  to  construct  a  software  controlled  cache.  The  TLB  sup- 
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Figure  6-4.  Average  Cycles  per  Instruction 


port  software  is  based  on  an  Am29000  TLB  register  format.  Members  of  29K  Family 
supporting  two  TLBs  will  require  some  small  changes  to  the  example  code.  The  sec¬ 
ondary  memory  shall  be  a  4-2  or  4-1  DRAM  memory  system.  Programs  shall  be 
paged  into  a  small  2-1  SRAM  memory.  If  the  paging  activity  can  be  kept  to  a  mini¬ 
mum,  it  is  possible  that  the  effective  average  cycle  per  instruction  will  approach  that 
of  SRAM  acting  alone. 

Current  costs  for  DRAM  devices  are  about  $5  for  256kx4  DRAMs  and  $10.50 
for  32Kx8  SRAMs.  At  these  prices  1M  byte  of  DRAM  would  cost  $40  and  1M  byte 
of  SRAM  $336.  Prices  will  of  course  continue  to  fall  on  aper-byte  basis.  However,  a 
large  difference  between  SRAM  and  DRAM  prices  will  remain,  and  SRAM  memory 
system  costs  will  remain  an  obstacle  in  obtaining  the  highest  system  performances.  A 
128Kbyte  SRAM  memory  cache  would  cost  $42.  Using  such  a  cache  in  conjunction 
with  a  secondary  DRAM  memory  is  a  cost  effective  way  of  achieving  high  perfor¬ 
mance.  Because  the  Am29000  processor  implements  TLBs  and  lightweight  inter¬ 
rupts  (see  section  4.3.3)  on-chip,  it  is  an  ideal  processor  to  implement  a  software 
cache  mechanism. 

6.2  TRANSLATION  LOOK-ASIDE  BUFFER  (TLB)  OPERATION 

The  Am29000  processor  has  a  number  of  special  purpose  support  registers  ac¬ 
cessible  only  by  the  processor  operating  in  Supervisor  mode.  Special  register  2, 
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know  as  the  Current  Processor  Status  (CPS)  register  has  two  bits  which  are  used  to 
enable  or  disable  the  MMU  operation.  Bit  PI,  if  set,  disables  the  MMU  for  all  instruc¬ 
tion  accesses.  Bit  PD,  if  set,  disables  the  MMU  for  all  data  accesses.  When  these  bit 
fields  are  both  set,  program  address  values  flow  directly  to  the  address  unit  unmodi¬ 
fied.  This  is  simply  known  as  physical  addressing. 

By  clearing  both  bits  PI  and  PD,  program  instruction  address  values  and  data 
address  values  are  presented  to  the  MMU  for  translation  and  other  checking.  The 
Am29000  generates  addresses  early.  This  means  addresses  are  presented  to  the 
MMU  during  instruction  execution.  The  MMU  completes  the  translation  during  the 
execution  cycle,  making  the  translated  address  available  at  the  start  of  the  next  pro¬ 
cessor  cycle.  The  MMU  does  not  need  to  check  every  address  value;  all  data  access 
LOAD  and  STORE  instruction  addresses  are  translated.  For  instruction  accesses, 
only  JMP  and  CALL  type  instructions  are  translated,  as  well  as  whenever  the  current 
execution  address  crosses  a  page  boundary.  Figure  6-5  shows  the  probability  of  an 
instruction  requiring  an  address  translation  for  the  four  utility  programs  previously 
studied.  Typically  about  30%  of  instructions  are  shown  to  require  address  transla¬ 
tions. 


Joint  I/D  2-1  SRAM  memory  system 


Figure  6-5.  Probability  of  a  TLB  Access  per  Instruction 


The  MMU  is  constructed  using  a  64  entry  Translation  Look-Aside  buffer 
(TLB).  Let’s  first  deal  with  how  the  TLB  registers  are  configured,  and  how  address 
translation  is  performed.  Later,  the  additional  functions  supported  by  the  TLB  regis¬ 
ters  will  be  studied.  TLB  registers  are  arranged  in  pairs  which  form  a  single  TLB 
entry. 

The  Am29000  processor  can  support  IK,  2K,  4K,  and  8K  byte  page  sizes.  Spe¬ 
cial  register  13,  the  Memory  Management  Unit  configuration  register  (MMU  regis¬ 
ter),  has  a  two  bit  field  (PS)  which  is  used  to  select  the  page  size.  For  the  following 
discussion  let’s  assume  the  PS  bits  are  set  to  give  a  page  size  of  4K  bytes. 
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The  lower  12  address  bits  will  be  unmodified  by  the  MMU  translation,  they  will 
flow  directly  to  the  address  pins.  The  next  five  address  bits  (bits  12  to  16)  will  be  used 
to  select  a  TLB  set.  See  Figure  6-6  for  address  field  composition.  If  the  page  size  had 
been  2K  bytes  then  address  bits  11-15  would  be  used  to  obtain  five  bits  for  TLB  set 
selection.  Whatever  the  page  size,  five  bits  are  required  to  select  from  one  of  32  TLB 
sets.  The  Am29000  processor  has  actually  64  TLB  entries  arranged  as  two  per  TLB 
set. 


31  23  15 

I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I 

Virtual  Address  Tag  Comparison  TLB  set 


11 


I  I  I  I  I  I  I  I  I  I  I 


Address  offset  within  Page 


Figure  6-6.  TLB  Field  Composition  for  4K  Byte  Page  Size 


Each  TLB  entry  contains  an  address  translation  for  a  single  page.  Therefore  the 
MMU  contains  translations  for  a  maximum  64  pages.  It  is  possible  the  address  re¬ 
quiring  translation  does  not  have  a  match  with  any  of  the  current  TLB  entries,  but  this 
will  be  discussed  later.  The  virtual  address  space  is  divided  into  32  sets  of  equal  sized 
pages  (known  as  sets  0  to  31).  Page  0  starting  at  address  0  belongs  to  set  0.  Page  1 
belongs  to  set  1  and  so  on.  Pages  32,  64  and  many  more  also  belong  to  set  0.  And 
likewise  page  31,63  and  more  belong  to  set  3 1 .  All  addresses  falling  on  pages  which 
are  members  of  the  set  must  obtain  an  address  translation  from  the  TLB  entrees 
which  are  associated  with  the  set.  This  is  know  as  Set  Associative  Translation.  If  a 
page  address  could  be  translated  by  an  entry  in  any  TLB,  then  the  translation  tech¬ 
nique  is  known  as  Fully  Associative. 

Compared  to  full  associative  mechanisms,  set  associative  translation  requires 
less  chip  area  to  implement  than  full  associative  mechanisms,  and  can  more  easily 
operate  at  higher  speeds.  However,  there  are  still  many  pages  which  compete  with 
each  other  to  get  their  address  translation  stored  in  a  TLB  assigned  to  the  associated 
TLB  set.  For  this  reason  the  Am29000  processor  supports  two  TLB  entries  per  set. 
This  is  often  expressed  as  “two  columns  per  set”.  A  page  associated  with  a  particular 
set  can  have  its  address  translation  located  in  any  of  the  two  possible  TLB  entries. 
This  leads  to  the  title:  Two-way  Set  Associative  Translation. 

To  determine  which  TLB  entry  has  a  valid  entry  for  the  page  currently  being 
translated,  the  upper  address  bits,  17-3 1  in  our  4K  byte  page  example,  are  compared 
with  the  the  VTAG  filed  in  the  TLB  entry.  The  VTAG  contains  the  corresponding 
upper  bits  for  the  TLB  entries  current  translation.  If  a  mach  occurs,  and  other  TLB 
permission  bit  field  requirements  are  also  satisfied,  then  the  TLB  RPN  field  supplies 
the  upper  address  bits  for  the  now  translated  physical  address.  In  our  4K  page  exam¬ 
ple  the  RPN  (Real  Page  Number)  field  would  supply  upper  address  bits  12  to  31, 
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which  when  combined  with  the  page,  offset  bits  0  to  11,  produce  a  32-bit  physical 
address.  See  Figure  6-7  for  a  block  diagram  of  the  TLB  layout. 

TLB  Column  0  TLB  Column  1 


Virtual  Address 


Physical  Address 


Figure  6-7.  Block  Diagram  of  Am29000 processor  TLB  Layout 

TLB  entries  are  constructed  from  fields  requiring  64-bit  storage.  This  results  in 
128  TLB  registers  supporting  the  64  TLB  entries  (32  sets  2 -ways  per  set).  Two  TLB 
registers  are  required  to  describe  a  TLB  entry.  The  first  TLB  register  holds  entry  word 
0  and  a  second  register  holds  entry  word  1 .  Figure  6-8  shows  the  TLB  register  layout. 

Now  that  the  address  translation  mechanism  has  been  discussed,  the  TLB  entry 
fields  can  be  examined  in  more  detail.  The  VTAG  and  RPN  fields  have  already  been 


Chapter  6  Memory  Mangement  Unit 


303 
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Figure  6-8.  Am29000  Processor  TLB  Register  Format 


described.  Word  0  contains  access  permission  fields.  First  look  at  the  TID  field  of 
word  0.  For  a  TLB  entry  to  match  with  the  current  translation,  not  only  must  the 
VTAG  match  with  the  upper  virtual  address  bits,  but  the  current  process  identifier 
(PID)  must  match  with  the  task  identifier  in  the  TID  field.  The  PID  is  located  in  an 
8-bit  field  in  the  MMU  configuration  register. 

Multi-tasking  operating  systems  assign  a  unique  PID  to  each  task.  Whenever  a 
context  switch  occurs  to  a  new  task  the  MMU  register  is  updated  with  the  PID  for  the 
currently  executing  task.  This  enables  the  MMU  to  support  multi-tasking  without 
having  to  flush  the  TLB  registers  at  every  context  switch.  TLB  entries  are  likely  to 
remain  until  a  task  is  again  restored  and  the  TLB  entries  reused.  TLB  entries  are  only 
valid  if  the  VE  bit  is  set,  the  VE  bit  for  each  TLB  entry  should  be  cleared  before  ad¬ 
dress  translation  is  enabled. 

When  the  processor  is  running  in  Supervisor  Mode  (the  SM  bit  in  the  CPS  regis¬ 
ter  is  set),  then  the  current  PID  value  is  zero,  regardless  of  the  PID  value  located  in  the 
MMU  register.  Each  TLB  entry  can  separately  enable  read,  write  and  execute  per¬ 
missions  for  accesses  to  the  mapped  page.  The  SE,  SR  and  SW  bits  control  access 
permissions  for  Supervisor  accesses  to  the  page.  The  UR,  UW  and  UE  bits  control 
access  permissions  for  the  TID  identified  user. 

If  no  currently  valid  mapping  can  be  found  in  the  two  associated  TLB  entries, 
then  a  TLB  miss  trap  occurs.  There  are  four  traps  assigned  to  support  address  transla¬ 
tion  misses,  two  are  reserved  for  the  processor  operating  in  Supervisor  mode,  and  a 
additional  two  can  be  taken  when  a  translation  is  not  found  when  the  processor  is  op¬ 
erating  in  User  mode.  Each  mode  has  separate  traps  for  instruction  address  transla- 
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tion  and  data  address  translation.  A  subsequent  section  describes  the  process  of  tak¬ 
ing  a  trap. 

Two  additional  traps  are  assigned  to  Supervisor  and  User  mode  protection 
violations.  These  occur  when  a  TLB  entry  has  a  valid  entry  but  the  permission  fields 
do  not  allow  the  type  of  access  being  attempted.  For  example  unless  the  UW  bit  is  set 
a  User  mode  process  can  not  write  to  the  mapped  page,  even  if  all  other  TLB  entry 
fields  indicate  a  match  with  the  translation  address. 

Now  examine  the  bit  fields  of  word  1 .  The  10  bit  is  little  used,  it  enables  a  virtual 
address  to  be  associated  with  a  physical  page  in  I/O  space.  The  U  bit  is  maintained  by 
the  Am29000  processor.  Whenever  a  TLB  set  is  used  in  a  valid  translation  the  U  bit 
associated  with  the  set  is  updated  to  indicate  which  of  the  two  TLB  entries  was  used. 
In  other  words,  the  U  bit  selects  the  column  within  the  set.  The  U  bit  is  used  to  supply 
the  most  significant  bit  in  the  least-recently  used  (LRU)  register.  Special  register  14 
has  a  6-bit  field  which  is  updated  whenever  an  address  translation  fails  and  a  TLB 
access  trap  occurs.  The  lower  5-bits  of  the  LRU  register  are  loaded  with  the  TLB  set 
number.  Thus  the  LRU  register  supplies  to  the  trap  handler  a  recommendation  for 
TLB  entry  replacement.  The  trap  handler  typically  builds  a  new  valid  TLB  entry  at 
the  recommended  location  before  execution  of  the  interrupted  program  is  continued. 

The  2-bit  PGM  field  is  not  assigned  a  task  by  the  Am29000  processor,  these  bits 
are  placed  on  the  PGM[1:0]  out  put  pins  when  a  translation  occurs.  Developers  can 
place  any  information  they  wish  in  the  PGM  bits.  These  bits  are  particularly  useful 
for  multiprocessor  applications  when  one  processor  wishes  to  signal  other  processors 
about  page  cache-ability  information. 

All  data  accesses  have  their  translated  address  and  corresponding  PGM  value 
presented  on  the  the  chip  pins  in  the  cycle  following  the  cycle  executing  the  LOAD  or 
STORE  instruction.  Pages  containing  instructions  have  their  corresponding  PGM 
bits  presented  to  the  chip  bins  when  a  jump  or  call  to  an  address  within  the  page  first 
occurred.  However,  if  the  target  of  the  jump  or  call  is  found  in  the  on-chip  instruction 
cache  and  the  address  bus  is  currently  in  use  when  jump  or  call  instruction  is  in 
execute,  the  PGM  bits  for  the  target  instruction  page  will  not  be  presented  to  the  chip 
PGM[1:0]  bins. 

In  this  chapter,  the  software  controlled  cache  code  example  shall  use  the  PGM 
bits  to  store  page-lock  and  page-dirty  information  in  bits  PGM[0]  and  PGM[1],  re¬ 
spectively. 

6.2.1  Dual  TLB  Processors 

Newer  microprocessor  and  microcontroller  members  of  the  29K  family  do  not 
have  the  full  complement  of  64  address  translations  cached  in  their  TLB.  A  smaller 
TLB  size  of  16  entries  enables  valuable  silicon  space  to  be  used  for  on-chip  func¬ 
tions;  such  as  peripherals.  To  support  the  smaller  number  of  TLB  entries,  the  maxi¬ 
mum  page  size  has  been  increased  from  8k  bytes  to  16M  bytes.  This  enables  a  large 
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amount  of  virtual  memory  to  be  mapped  with  the  reduced  number  of  translation  en¬ 
tries.  Note,  the  Page  Size  (PS)  field  in  the  MMU  configuration  register  is  increased 
from  2-bits  to  3-bits  to  support  wider  page-size  selection. 

A  consequence  of  the  smaller  number  of  TLB  sets  (8  for  16  two-way  entries)  is 
a  larger  VTAG  field.  The  Am29000  processor  uses  5-bits  to  select  from  its  32  sets 
(64  entries).  The  Am29240  only  requires  3-bits  to  select  the  correct  set.  The  loss  of 
2-bits  for  set  selection  causes  a  corresponding  increase  in  the  VTAG  field.  With  a 
minimum  page  size  of  lk  bytes  (10  address  bits),  a  maximum  VTAG  field  of  19-bits 
is  required.  To  enable  the  VTAG  field  to  fit  within  the  TLB  Entry  Word  0,  two  per¬ 
mission  bits  are  omitted.  The  Supervisor  Read  (SR)  and  Supervisor  Execute  (SE)  ac¬ 
cess  protection  is  not  available  with  processors  supporting  larger  page  sizes.  Conse¬ 
quently,  Supervisor  mode  programs  can  always  read  and  execute  code/data  from 
pages  which  have  a  currently  valid  mapping.  The  TLB  register  format  is  shown  on 
Figure  6-9. 
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Figure  6-9.  TLB  Register  Format  for  Processor  with  Two  TLBs 


The  Am29243  microcontroller  supports  two  TLBs.  This  enables  valid  transla¬ 
tions  for  a  larger  virtual  address  space  to  be  maintained  at  any  time.  Each  TLB  oper¬ 
ates  independently  and  they  can  be  programmed  with  different  page  sizes.  The  MMU 
configuration  register  has  two  Page  Size  (PS)  fields;  one  for  each  TLB.  Dividing  the 
TLB  register  space  (128  registers)  into  two  TLBs  enables  up  to  32  translations  to  be 
held  in  each  TLB.  Each  Am29243  TLB  implements  16  of  the  possible  32  transla¬ 
tions.  The  Least  Recently  Used  (LRU)  register  has  two  LRU-recommendation 
fields,  one  for  each  TLB.  The  fields  are  arranged  such  that  future  processors  can  im¬ 
plement  the  complete  complement  of  16  sets  (32  translations)  per  TLB.  When  a  TLB 
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miss  occurs  both  LRU  fields  are  update.  Support  software  must  decide  which  LRU 
field  to  use  and  consequently  which  TLB  to  update.  If  the  TLBs  are  allocated  to  dif¬ 
ferent  address  regions,  the  miss  address  can  be  used  to  select  the  appropriate  field. 

TLB  Entry  Word  1  has  an  additional  entry  compared  with  the  Am29000  register 
format  —  the  Global  Page  (GLB)  bit;  when  set,  the  mapped  page  can  be  accessed  by 
any  processes  regardless  of  its  process  identifier  (PID).  This  can  be  very  useful  when 
dealing  with  regions  of  shared  code  or  data.  Multiple  processes  can  accessed,  say,  a 
shared  library,  without  each  process  having  to  have  valid  translation  entries  for  the 
memory  pages  containing  the  shared  information 

The  Am29040  2-bus  processor  also  supports  two  TLBs.  The  TLB  register  for¬ 
mat  is  the  same  as  used  with  the  Am29240  microcontroller.  However,  there  are  a 
number  of  additional  fields  implemented  in  Entry  Word  1 .  The  width  of  data  bus  used 
for  external  memory  accesses  can  be  reduced  to  16-bits  if  the  D16  bit  is  set.  When 
set,  a  32-bit  data  object  is  accessed  via  two  16-bit  accesses.  The  D16  bit  simplifies 
access  to  memory  or  other  device  which  must  be  accessed  with  a  16-bit  width  for¬ 
mat.  The  PCE  bit  enables  parity  checking  for  the  mapped  page.  Parity  is  odd  or  even 
depending  in  the  POE  bit  in  the  Configuration  Register  (CFG). 


Table  6-1.  PGM  Field  of  the  Am29040  Microprocessor  TLB 


PGM1 

PGMO 

Effect 

0 

0 

normal  (copy-back) 

0 

1 

write-through 

1 

0 

reserved 

1 

1 

none  cachable 

With  virtual  addressing  in  use,  the  Am29040  TLB  entries  enable  a  data  cache 
maintenance  policy  to  be  selected  on  a  perpage  bases  (see  Table  6-1).  The  default 
copy-back  policy  generaly  achieves  the  highest  performance.  When  the  MMU  is  not 
in  use  (physical  addressing)  a  copy-back  policy  is  applied  for  cached  data.  See  sec¬ 
tion  5.14.2  for  more  details  about  Am29040  data  cache  policy.  Note,  when  the  D16 
bit  is  set,  the  access  is  considerd  non  cacheable. 

The  example  code  presented  in  this  chapter  for  a  software  controlled  cache  is 
based  on  the  Am29000-type  TLB  register  format.  To  make  the  code  work  with  an 
Am2924x  or  Am29040  processor  would  require  some  small  changes.  The  code  se¬ 
quences  requiring  modification  would  be  in  the  construction  of  TLB  entry  Word  0 
and  Word  1.  This  does  not  detract  from  the  value  of  the  example  code. 

6.2.2  Taking  a  TLB  Trap 

The  address  translation  performed  by  the  MMU  is  determined  by  the  trap  han¬ 
dler  routines  which  are  used  to  update  the  TLB  registers .  When  the  current  processor 
status  register  bits  PD  and  PI  are  both  clear,  enabling  the  MMU  hardware  for  both 
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data  and  instruction  address  translation,  the  DA  and  FZ  bits  in  the  CPS  register  must 
also  be  cleared.  Clearing  these  bits  disables  Am29000  special  register  freezing  and 
enables  traps  to  be  taken. 

When  the  MMU  does  not  contain  a  match  for  the  current  address  translation,  a 
trap  is  taken  by  the  processor.  This  also  happens  for  valid  translations  not  meeting 
permission  requirements.  The  software  executed  by  the  trap  handler  must  construct  a 
TLB  entry  for  the  failing  address  from  page  table  entries  (PTEs)  stored  in  memory. 
The  TLB  registers  simply  act  as  a  cache  for  the  currently-needed  translations  stored 
in  off-chip  data  memory. 

Many  CISC-type  processors  have  algorithms  in  the  chip  microcode  for  auto¬ 
matically  updating  the  MMU  hardware  from  more  extensive  data  located  in  external 
data  memory.  Because  the  Am29000  does  not  implement  this  function  in  hardware, 
the  user  is  free  to  construct  a  software  algorithm  for  TLB  reloading  which  best  suits 
the  memory  management  architecture.  This  increased  flexibility  outweighs  any  re¬ 
duction  in  TLB  register  reload  time  that  may  occur  for  some  configurations.  The 
flexibility  is  what  makes  possible  the  software  controlled  cached  described  later. 

When  the  Am29000  takes  a  trap  the  processor  enters  Supervisor  mode  with  fro¬ 
zen  critical  support  registers.  This  is  known  as  Lreeze  mode.  A  more  complete  ex¬ 
planation  is  given  in  Chapter  4  (Interrupts  and  Traps).  The  frozen  special  registers 
describe  the  state  of  the  processor  at  the  time  of  the  address  translation  failure.  Ex¬ 
amining  these  registers  enables  the  trap  handler  software  to  determine  the  necessary 
action  and  eventually  restart  the  instruction  in  execute  when  the  trap  occurred.  After 
the  trapware  routines  have  constructed  the  required  TLB  entry,  the  faulting  instruc¬ 
tion  will  be  able  to  complete  execution. 

Later  sections  will  deal  with  the  trapware  in  detail  for  the  example  software  con¬ 
trolled  cache  system.  The  interesting  details  of  the  trapware  will  be  covered  then. 
Since  the  code  is  memory  architecture  specific,  the  operation  of  the  software  con¬ 
trolled  cache  needs  to  be  discussed  first.  This  discussion  is  in  the  later  section  entitled 
Software  Controlled  Cache  Memory  Architecture  (section  6.4). 

6.3  PERFORMANCE  EQUATION 

Performance  has  been  considered  in  terms  of  average  number  of  cycles  per 
instruction  execution.  This  is  a  useful  metric  when  considering  memory  system  ar¬ 
chitectures.  Ligure  6-1,  Ligure  6-2  and  Ligure  6-4  give  average  cycles  per  user 
instruction  execution  (Ac/i).  However,  if  a  TLB  miss  occurs  during  instruction 
execution,  a  number  of  Supervisor  mode  trapware  instructions  will  be  required  to 
prepare  the  TLB  registers  before  the  user’s  code  can  continue.  If  TLB  trapware  is  ac¬ 
tivated  in  support  of  too  many  instructions,  then  the  effective  number  of  cycles  re¬ 
quired  per  application  instruction  will  increase. 

The  effective  average  cycles  per  instruction  is  given  by:  Aeffectjve  =  P  Ac/i  where 
Ac/i  is  the  average  number  of  cycles  per  instruction  for  the  program  running  in  physi- 
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cal  mode,  without  the  MMU  in  operation.  The  multiplying  factor,  P,  determines  how 
much  performance  is  reduced  by  the  use  of  the  MMU  hardware.  The  value  of  P  is 
given  by: 


Ptlb/i  Pmiss  Teyeles 


Ac/I 

We  shall  look  at  the  terms  of  this  equation  individually  to  determine  their  effect. 
Term  Ptlb/i  is  the  probability  an  instruction  shall  cause  a  TLB  access.  Figure  6-5 
showed  average  figures  for  Ptlb/i  observed  with  the  four  benchmark  programs  ex¬ 
amined.  Given  that  a  TLB  access  occurs,  we  are  then  interested  in  the  probability  that 
an  entry  is  not  found  and  a  miss  trap  is  taken.  This  conditional  probability  is  given  by 
term  Pmjss,  and  Figure  6-10  shows  average  Pmiss  values  for  the  four  benchmark  pro¬ 
grams  running  on  the  software  controlled  cache  system  . 


Ik  2k  4k  8k  page  size  (bytes) 
directly  mapped 

compress  □  assembler 

□  n  roll  dill 


Figure  6-10.  TLB  Miss  Ratio  for  Joint  I/D  2-1  SRAM 
System 

What  matters  at  present  is  we  observe  that  TLB  miss  rates  increase  as  we  de¬ 
crease  page  size.  This  is  expected  because  smaller  page  sizes  mean  a  smaller  portion 
of  the  program’s  pages  have  mappings  currently  cached  in  the  TLB  registers.  Given 
that  the  Am29000  processor  has  a  fixed  number  of  TLB  entries,  it  is  best  to  have  large 
page  sizes  if  TLB  misses  are  to  be  reduced.  However,  the  better  granularity  of  small 
page  sizes  may  lead  to  better  physical  memory  utilization.  An  additional  consider¬ 
ation  is  the  size  of  pages  transported  from  secondary  memory  such  as  disk  or  network 
connections.  Secondary  memory  communication  may  be  improved  by  better  com¬ 
munication  efficiencies.  These  efficiencies  may  be  achieved  with  larger  page  sizes. 
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I  I  nroff  4-1  DRAM  with  128  page  2-1  cache 
□  assembler 


Figure  6-11.  Average  Cycles  Required  per  TLB  Miss 

The  final  term  of  the  Tcycies  equation  is  the  average  number  of  cycles  required  to 
process  a  TLB  miss.  Figure  6-11  shows  values  for  the  four  benchmark  programs  run¬ 
ning  on  the  cache  system.  When  a  TLB  miss  occurs  for  a  page  which  is  not  currently 
located  in  the  physical  memory  but  in  secondary  memory,  a  large  number  of  proces¬ 
sor  cycles  is  required  to  first  transfer  the  page  from  secondary  memory  to  physical 
memory  and  then  build  a  valid  TLB  entry.  As  the  page  size  increases  the  TLB  miss 
trap  handler  execution  time  increases  substantially. 

The  product,  Ptlb/i  Pmiss  TCyCies  gives  the  average  number  of  cycles  overhead 
added  to  each  application  instruction  in  order  to  support  the  MMU  operation.  After 
studying  the  software  cache  memory  architecture,  the  effective  number  of  cycles  per 
instruction  achieved  will  be  reexamined  and  compared  with  the  non-cache  memory 
architecture  performance. 

6.4  SOFTWARE  CONTROLLED  CACHE  MEMORY  ARCHITECTURE 

By  studying  a  software  controlled  cache  mechanism  we  can  achieve  three  ob¬ 
jectives:  First,  a  better  understanding  of  the  non-TLB-cached  page-table  layout.  Se¬ 
cond,  further  understanding  of  TLB  trapware  implementation  detail.Thirdly,  an 
awareness  of  software  controlled  cache  benefits. 

When  a  TLB  miss  occurs,  the  trap  handler  must  determine  the  replacement  TLB 
entry  data.  It  does  this  by  indexing  a  table  of  Page  Table  Entries  (PTEs).  Each  PTE 
contains  information  on  how  to  map  a  physical  page  into  its  corresponding  secondary 
memory  page.  In  our  example  system,  the  physical  memory  is  SRAM  and  the  sec¬ 
ondary  memory  is  DRAM.  In  fact,  the  secondary  memory  is  physically  addressable, 
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but  the  execution  of  all  programs  from  within  the  limited  sized  SRAM  cache  will  be 
attempted,  and  the  DRAM  will  only  be  accessed  when  a  page-to  or  page-from  sec¬ 
ondary  memory  needs  copying. 

There  are  many  different  PTE  table  arrangements.  Some  systems  have  multiple 
layers  of  PTEs,  where  a  higher  level  PTE  points  to  tables  of  lower  level  PTEs.  In  mul¬ 
ti-tasking  systems,  each  task  may  have  its  own  table  of  PTEs.  And  if  the  Supervisor 
code  also  executes  with  address  translation,  then  it  may  also  have  a  table  of  PTEs.  To 
simplify  our  example  system,  we  will  assume  the  supervisor  always  runs  in  physical 
mode,  and  there  is  a  single  table  of  PTEs  shared  by  all  User  mode  programs .  To  evalu¬ 
ate  the  system  performance,  only  single  User  mode  tasks  will  be  run,  in  particular  the 
nroff  and  assembler  utility  programs. 

PTEs  need  not  have  the  same  structure  as  TLB  entries.  They  typically  do  not. 
This  enables  the  memory  management  system  to  keep  additional  page  information  in 
memory  and  only  cache  critical  data  in  the  TLB  registers.  In  addition  it  may  be  pos¬ 
sible  to  compact  information  into  a  smaller  PTE  structure,  which  results  in  a  substan¬ 
tial  space  saving  in  systems  which  keep  extensive  PTE  tables  permanently  in  physi¬ 
cal  memory  (in  our  case  SRAM).  For  the  example  system,  PTEs  shall  have  exactly 
the  same  format  as  TLB  entries.  The  method  has  the  benefit  that  TLB  entries  can  be 
loaded  from  PTE  memory  location  directly  without  additional  processor  cycles  be¬ 
ing  expended  in  reformatting. 

The  PTE  format  will  be  4-way  set  associative.  The  number  of  sets  shall  be  lim¬ 
ited  by  the  amount  of  available  SRAM  cache  memory,  but  a  lower  limit  of  32,  estab¬ 
lished  by  the  Am29000,  is  required.  Given  a  minimum  page  size  of  IK  bytes,  the 
SRAM  can  not  be  smaller  that  128K  bytes  (1Kx4x  32).  If  the  number  of  PTE  sets  is 
greater  than  32,  then  the  cache  has  more  set  resolution  than  the  TLBs.  In  this  case  a 
TLB  set  caches  entries  for  more  than  one  PTE  set,  and  the  TLB  VTAG  field  has  more 
address  resolution  than  the  PTE  VTAG  field  requires. 

Each  TLB  entry  indicates  how  the  user’s  virtual  address  is  mapped  into  an 
SRAM  page  number  (given  by  the  TLB  RPN  entry).  The  PTE  entries  must  have  a 
mapping  relationship  with  DRAM  memory  pages  and  SRAM  memory  pages.  The 
entries  use  the  PTE  RPN  field  to  store  the  DRAM  page  number.  PTEs  also  have  a 
mapping  relationship  with  SRAM  pages.  This  enables  the  memory  page  maintained 
by  the  PTE  to  be  moved  between  SRAM  and  DRAM.  The  PTE  SRAM  mapping  is 
simple.  PTEs  and  SRAM  pages  are  stored  consecutively  in  memory,  as  are  SRAM 
pages.  Given  the  PTE  address,  the  corresponding  SRAM  page  address  can  be  found 
by  determining  the  PTE  address  displacement  from  the  PTE  table  base.  The  PTE  dis¬ 
placement,  multiplied  by  the  page  size,  will  locate  the  SRAM  page  relative  to  the 
base  address  of  SRAM  pages.  Figure  6-12  outlines  the  system. 
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Figure  6-12.  PTE  Mapping  to  Cache  Real  Page  Numbers 


Because  the  PTE  entries  are  not  an  exact  cache  of  PTE  entries,  due  to  the  RPN 
field  differences,  TLB  register  word  1  must  be  adjusted  accordingly  before  the  TLB 
register  can  be  updated  form  the  PTE  entry. 

The  Am29000  C  language  calling  convention  reserves  processor  registers 
gr64-gr95  for  operating  system  use.  To  improve  trap  handler  performance  a  number 
of  these  registers  are  used  by  these  critical  routines.  For  temporary  use,  six  registers 
are  required,  and  for  static  information  caching  two  registers  are  used.  The  particular 
registers  used  are  described  later  along  with  the  example  code.  The  two  static  regis¬ 
ters  are  of  particular  interest;  they  will  give  them  synonyms  upte  and  kmsp. 

It  is  desirable  to  keep  critical  data  and  routines  in  SRAM  memory.  For  example, 
the  TLB  miss  handler  routines  should  be  stored  in  cache  memory.  Cached  pages  can 
be  marked  as  locked-in,  this  will  prevent  them  from  being  paged-out  to  DRAM. 
However,  the  SRAM  is  only  intended  to  hold  User  mode  application  pages.  Trap  han¬ 
dlers  and  other  critical  operating  system  routines  run  in  Supervisor  mode,  and  in  our 
example  system,  without  address  translation.  In  practice,  a  larger  SRAM  could  be 
implemented  and,  say,  half  allocated  for  cache  use;  the  other  half  being  reserved  for 
operating  system  code  and  data.  This  may  not  lead  to  the  most  efficient  use  of  such  an 
effective  resource  as  SRAM.  The  problem  can  be  overcome  by  marking  certain  PTE 
entries  as  invalid  but  locked.  The  SRAM  pages  corresponding  to  these  PTE  can  then 
be  accessed  in  non-translated  address  mode  by  Supervisor  mode  code. 
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Since  the  PTE  table  is  frequently  accessed  by  TLB  trapware,  it  is  important  that 
quick  access  to  the  table  is  supported.  For  this  reason  register  upte  is  initialized  to 
point  to  the  base  of  the  PTE  table,  and  the  table  is  located  in  the  first  SRAM  page.  One 
SRAM  page  can  contain  32  sets  of  PTE  data.  In  multi-tasking  systems,  with  each 
task  having  its  own  PTE  table,  the  upte  value  is  normally  stored  in  a  per-task  data 
structure  know  as  the  Process  Control  Block  (PCB),  and  the  upte  register  is  updated 
from  the  PCB  data  at  each  context  switch. 

The  Am29000  takes  traps  very  quickly,  without  expending  a  number  of  internal 
processor  cycles  preparing  an  interrupt  processing  context  for  the  processor.  This  ad¬ 
vantage  over  typical  CISC  processor  operation  enables  the  Am29000  to  process  the 
trap  quickly  in  Freeze  mode  and  return  to  the  user’s  program.  It  is  the  Freeze  mode 
processing  capability  of  the  Am29000  that  makes  a  soft  cache  mechanism  attractive. 
However,  TLB  miss  handlers  can  not  always  complete  their  handling  quickly  in 
Freeze  mode  code.  In  such  cases  they  must  signal  the  operating  system  to  continue 
with  further  processing,  Freeze  mode  is  departed,  and  Supervisor  mode  with  freeze 
disabled  is  entered.  Before  Freeze  mode  can  be  exited,  the  frozen  special  registers 
must  be  stored  on  a  Supervisor  mode  memory  stack.  They  will  have  to  be  restored 
from  this  stack  once  the  operating  system  completes  the  TLB  miss  processing.  The 
operating  system  stack  is  located  on  page  4,  which  is  in  a  different  set  from  the  PTE 
table.  Operating  system  accessible  register  kmsp  is  used  as  a  stack  pointer. 

Using  the  cache  architecture  described,  the  nrojf  and  assembler  utilities  were 
observed  running  in  a  128  page  SRAM  based  system.  The  page-in  activity  is  shown 
on  Figure  6-13.  It  appears  the  two  programs  were  too  large  to  execute  in  128K  byte 
SRAM  (IK  byte  page  size).  The  paging  activity  is  at  a  minimum  with  a  256K  byte 
cache  (2K  byte  page  size).  It  is  possible  the  increased  paging  activity  is  due  to  cache 
sets  being  only  4-way.  In  the  case  of  nrojf ,  it  is  more  likely  the  page  replacement  al¬ 
gorithm  was  having  difficulty  in  keeping  the  desired  pages  in  the  cache  for  such  a 
large  program. 

As  page  sizes  get  larger,  the  probability  of  a  TLB  miss  diminishes.  Since  the 
cache  gets  larger  for  a  given  SRAM  of  fixed  number  of  pages,  expect  the  probability 
of  a  page-in  to  increase  as  page  size  increases.  Reflecting  the  fact  that  with  large 
caches,  a  TLB  miss  causes  a  page-in  and  the  TLB  maintains  a  cached  entry  for  the 
permanently  resident  page.  Figure  6-14  gives  the  probability  of  a  page-in  given  a 
TLB  miss  has  occurred. 

With  the  nrojf  utility,  the  probability  actually  reduces  when  the  page  size  is  in¬ 
creased  from  IK  byte  to  2K  byte.  This  is  because  of  the  cache-thrashing  occurring 
with  the  128K  byte  cache  used  with  the  IK  byte  page  size. 

6.4.1  Cache  Page  Maintenance 

The  example  software  controlled  cache  system  only  supports  User  mode  ad¬ 
dress  translation.  This  means  Supervisor  mode  TLB  miss  handlers  will  not  be  consid- 
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Figure  6-13.  Software  Controlled  Cache,  K  bytes  paged— in 
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Figure  6-14.  Probability  of  a  Page-in  Given  a  TLB  Miss 


ered.  TLB  entries  shall  always  enable  instruction  execution  for  each  page,  this  elimi¬ 
nates  support  for  the  TLB  instruction  access  protection  violation  trap.  Pages  will  be 
initially  marked  as  non-writeable,  as  will  be  seen  this  supports  maintenance  of  the 
page-dirty  bit.  So  in  total,  we  need  only  deal  with  three  traps:  Instruction  access  miss, 
data  access  miss,  and  data  access  protection  violation. 
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The  Am29000  has  65  global  registers  ( grl ,  gr64-gr!27),  of  these  32  are  re¬ 
served  for  operating  system  use  only  ( gr64-gr95 ).  To  improve  the  performance  of 
the  trapware,  several  of  the  operating  system  registers  have  been  assigned  TLB  han¬ 
dler  support  functions.  The  following  code  uses  register  synonyms,  so  the  actual  reg¬ 
ister  assignments  can  be  easily  changed. 


.  reg 

itO, gr64 

; Freeze  mode 

.  reg 

itl, gr65 

; temporary  regs 

.  reg 

it2 , gr66 

.  reg 

it3 , gr 67 

.  reg 

ktO, gr68 

; temporary  regs 

.  reg 

ktl, gr69 

.  reg 

kmsp, gr 93 

; supervisor  M-stack 

.  reg 

upte, gr95 

The  code  shown  within  this  chapter  makes  use  of  a  number  of  macros  for  push¬ 

ing  and  popping  special  registers  to  an  external  memory  stack.  These  macros,  push  , 
pushsr,pop  and  popsr,  were  described  in  section  3.3.1  ( Useful  Macro-Instructions). 

The  example  code  can  be  used  to  construct  a  cache  of  various  number  of  PTE 

entries  (ways  or  columns)  per  set,  and  total  number  of  sets.  The  constant  definitions 
shown  below  are  used  to  control  the  cache  size. 

.  equ 

PGSIZE, 10 

;Page  size 

.  equ 

C_SETS, 6 

; cache  sets 

.  equ 

C_COLUMNS , 2 

; columns  per  sets 

.  equ 

WSIZE, 512 

; window  size 

.  equ 

SIG_ICMISS, 1 

; signal  I-miss 

.  equ 

SIG_DCMISS,  3 

; signal  D-miss 

.  equ 

SIG_PROTECT,  5 

; signal  W-protect 

.  equ 

CTX_CHC , 3*4 

; context  offset 

.sect  cache, bss 
.use  cache 

cache_adds : 

.block  ( 1<<PGS I ZE ) * ( 1<<C_SETS ) * ( 1<<C_C0LUMNS ) 

The  operating  system  code,  which  is  not  shown,  is  responsible  of  initializing 
support  registers,  kmsp,  and  upte.  It  must  also  mark  the  PTEs  locked  and  invalid  for 
any  SRAM  pages  which  are  not  to  be  used  for  caching,  but  by  the  operating  system. 
The  example  code  uses  pages  0  and  4  to  store  performance  critical  support  data. 

6.4.2  Data  Access  TLB  Miss 

When  a  read  or  write  data  access  occurs  for  a  page  whose  translation  from  virtu¬ 
al  to  physical  address  is  currently  not  in  the  TLB  registers,  a  TLB  miss  is  taken.  This 
causes  execution  to  vector  to  trap  number  9.  The  address  of  the  trapware  handler, 
UDTLBMiss,  is  at  location  9  in  the  vector  table.  A  miss  may  occur  because  the  ac- 
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cessed  page  is  currently  not  in  the  cache,  or,  more  importantly,  because  the  PTE  map¬ 
ping  the  cached  page  is  currently  not  cached  by  the  TLB  registers.  The  PTEs  for  the 
appropriate  set  must  be  scanned  to  determine  if  the  page  is  in  the  cache. 

When  a  trap  is  taken,  the  Am29000  processor  special  support  registers  are  fro¬ 
zen,  their  contents  report  the  state  of  the  processor  at  the  time  of  the  trap.  Special  reg¬ 
ister  CHA  contains  the  virtual  address  for  the  failing  data  access.  Using  the  CHA  val¬ 
ue,  the  cache  set  is  determined  and  the  4  PTE  columns  assigned  to  the  set  are  scanned. 
The  PTE  valid  bit  must  be  set  and  the  PTE  VTAG  field  must  match  with  the  upper  bits 
of  the  CHA  address  for  a  match  to  be  found.  Note,  the  example  code  does  not 
compare  the  TID  field;  this  would  be  necessary  if  the  cache  were  supporting  a  multi¬ 
tasking  operating  system. 


UDTLBmiss : 

mfsr  itO, cha 

const  ktl, SIG_DCMISS  ; signal  number 

srl  it2 , it 0 , PGSIZE  ; select  cache  set 

and  it2,it2,  (1«C_SETS)-1 

sll  it2 , it2 , 3+C_COLUMNS 

add  it2,it2,upte  ; adds  of  1st  PTE 


scan_columns : 

srl  itO, itO, PGSIZE+5 

sll  itO, itO, PGSIZE+5 

const  ktO, ( 1<<C_C0LUMNS) -1 
next_column : 


jmpt 

kt  0 , not_cached 

sub 

kt  0 ,  kt  0 ,  1  ; 

dec  column  count 

load 

0,  0,  it  1,  it 2  ; 

load  word  0 

add 

it 2 , it 2 , 8  ; 

next  PTE  entry 

sll 

it 3, it 1, 31-14  ; 

test  VE-bit 

jmpf 

it 3, next_column 

srl 

it3 , itl, PGSIZE+5; 

mask  PTE  VTAG 

sll 

it3 , it3, PGSIZE+5 

cpeq 

it 3,  it 0,  it 3  ; 

compare  VTAG 

jmpf 

it 3, next_column 

mfsr 

it 3,  LRU 

sub 

it 2 , it 2 , 4  ; 

adds  word  1 

If  a  PTE  is  found  in  the  set  which  matches  with  the  CHA  address,  then  the  TLB 
entry  of  the  associated  set,  selected  by  the  LRU  register,  is  updated  with  the  contents 
of  the  matching  PTE.  Field  RPN  of  word  1  of  the  PTE  is  not  filled  with  the  secondary 
memory  (DRAM)  page  number  taken  from  the  PTE,  but  with  the  page  number  of  the 
SRAM  cache  page. 

in_cache : 

;Word  0  in  itl,it2  points  to  PTE  word  1 

load  0,0,it0,it2  ; load  word  1 

mttlb  it3,itl  ; assign  Word  0 

add  it3, it3, 1 

and  it0,it0,0xcl  ;mask  out  RPN 
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sub 

it 1 , it2 , upte 

; set  offset ; 

srl 

itl, itl, 3 

; set  index; 

sll 

itl, itl, PGSIZE 

; cache  page  offset 

add 

itl , itl , upte 

; cache  RPN 

or 

itO, itl, itO 

; or  in  cache  RPN 

mttlb 

iret 

it3, itO 

; assign  Word  1 

When  the  required  page  is  found  in  the  cache,  the  TLB  handler  executes  very 
quickly  without  ever  leaving  Freeze  mode.  After  the  TLB  entry  has  been  updated  an 
IRET  instruction  causes  execution  to  be  restarted  from  the  state  defined  by  the  frozen 
special  registers.  The  trapware  is  arranged  so  the  most  frequently  occurring  events 
are  processed  first  and  suffer  the  lowest  support  overhead.  However,  if  the  page  is  not 
found  in  the  cache  (no  matching  PTE)  then  the  trapware  must  call  on  the  operating 
system  to  complete  the  necessary  processing.  It  does  this  by  sending  a  signal.  The 
code  following  label  not_cached  pushes  the  contents  of  the  special  registers  as  well 
as  other  signal  information  onto  a  signal  frame  on  the  Supervisor  memory  stack. 
Execution  is  then  forced  to  continue  in  Supervisor  mode  with  non-translated  addres¬ 
sing  at  tlb_sig_handler.  The  signal  frame  shall  be  used  to  repair  the  special  registers 
after  the  higher  level  operating  system  support  code  has  completed. 

not_cached : 

; Send  a  signal  to  the  operating  system 


push 

kmsp, ktl 

;push  signal  number 

push 

kmsp, gr 1 

;push  grl 

push 

kmsp, rab 

;push  rab 

const 

itO, WSIZE 

sub 

grl, rfb, itO 

;  set  grl=rfb-WSIZE 

sub 

rab, rfb, itO 

;  set  rab=rfb-WSIZE 

pushsr 

kmsp, itO, pcO 

;push  pcO 

pushsr 

kmsp, itO, pci 

pushsr 

kmsp, itO, pc2 

pushsr 

kmsp, itO, cha 

pushsr 

kmsp, itO, chd 

pushsr 

kmsp, itO, chc 

pushsr 

kmsp, itO, alu 

pushsr 

kmsp, itO, ops 

;push  ops 

r 

push 

kmsp, tav 

;push  tav 

cpeq 

tav, ktl, SIG_ 

ICMISS 

jmpt 

tav, i_miss 

mf  sr 

tav, pci 

;pass  address 

mf  sr 

tav, cha 

r 

i_miss : 

mtsrim 

chc,  0 

; cancel  load/store 

mtsrim 

ops , 0x70 

; set  PD | PI | SM 

r 

const 

itl,  tlb_sig_! 

handler 

consth 

itl,  tlb_sig_! 

handler 

add 

itO, itl, 4 

; trampoline  signal 

mtsr 

pci, itl 

;  handler 
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mtsr  pcO, itO 
iret 

The  signal  frame  has  a  signal  number  field  which  is  used  to  report  the  type  of 
TLB  trap  which  occurred.  The  layout  of  the  frame  is  given  in  Figure  6-15.  Global 
register  tav  ( grl21 )  is  used  to  pass  the  address  causing  the  trap  to  occur.  For  a  TLB 
data  miss,  the  address  is  already  contained  in  the  CHA  register,  but  copying  it  to  tav  is 
convenient  because  the  signal  handler  code  is  also  shared  by  other  routines. 


supervisor 
memory 
stack,  higher  ad¬ 
dresses  at  top  of 
figure 


kmsp 


TLB  signal  Frame 


signal  number 

grl 

rab 

PCO 

PCI 

PC2 

CHA 

CHD 

CHC 

ALU 

OPS 

tav 

Figure  6-15.  TLB  Signal  Frame 


6.4.3  Instruction  Access  TLB  Miss 

Instruction  access  TLB  misses  are  dealt  with  in  the  same  way  as  data  access  mis¬ 
ses.  Only  the  signal  number  is  different  and  the  faulting  address  is  contained  in  spe¬ 
cial  register  PCI  rather  than  CHA.  Register  PCI  contains  the  address  of  the  instruc¬ 
tion  in  execute  at  the  time  of  the  failing  address  translation.  Since  cache  pages  con¬ 
tain  both  instructions  and  data,  the  same  set  of  PTE  apply  for  data  and  instruction  ad¬ 
dress  values.  Via  the  interrupt  vector  table,  the  User  mode  instruction  access  trap 
number  8  causes  execution  to  continue  at  address  label  UITLBmiss. 

UITLBmiss : 


mf  sr 

const 

itO, pci 

ktl, SIG_ICMISS 

;  signal 

number 

srl 

it2 , itO, PGSIZE 

; select 

cache  set 

and 

sll 

it2 , it2 , ( 1<<C_SETS ) — 1 
it2 , it2 , 3+C_COLUMNS;  PTE 

set  offset 

jmp 

add 

scan_columns 
it2 , it2 , upte 

; adds  of 

1st  PTE 
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6.4.4  Data  Write  TLB  Protection 


The  following  signal  handler  code  is  responsible  for  moving  pages  from  sec¬ 
ondary  DRAM  to  SRAM  cache  memory  (paging-in).  When  pages  are  first  paged-in 
they  are  given  read  and  execute  permissions  only,  unless  the  initial  faulting  access  is 
due  to  a  data  write.  At  some  time  later  during  program  execution,  a  write  to  the 
cached  page  may  occur.  When  this  happens,  a  data  write  protection  trap  is  taken,  and 
execution  is  vectored  to  address  label  tlb_data_prot. 

In  the  same  way  as  a  data  TLB  miss,  the  associated  PTE  entries  are  scanned  to 
find  the  matching  entry.  There  must  be  a  matching  entry  and,  in  addition,  a  cached 
TLB  entry  which  is  disallowing  write  access.  Once  the  PTE  has  been  found,  the  CHA 
address  value  is  again  used  to  find  the  associated  TLB  entry.  Note,  the  LRU  register 
can  not  be  used  because  it  is  only  updated  on  TLB  misses.  To  find  the  TLB  entry,  the 
VTAG  portion  of  the  CHA  address  is  compared  with  the  only  two  possible  TLB  en¬ 
tries  associated  with  the  set. 

;A  write  request  to  a  read-only  page  has  occurred. 

tlb_data_prot : 

mfsr  itO, cha 

const  ktl, SIG_PROTECT  ; signal 

srl  it2 , it 0 , PGSIZE  ; select  cache  line 

and  it2,it2,  (1«C_SETS)-1 

sll  it2 ,  it2 , 3+C_COLUMNS ; PTE  set  offset 

add  it2,it2,upte  ; adds  of  1st  PTE 


scan : 

srl  itO,  itO,  PGSIZE+5; adds  VTAG 

sll  itO, itO, PGSIZE+5 

const  ktO, ( 1<<C_C0LUMNS ) -1 

nxt_column : 

jmpt  ktO, not_cached 

sub  ktO,ktO,l  ;dec  column  count 

load  0,0,itl,it2  ; load  word  0 

add  it2,it2,8  ; next  PTE  entry 

sll  it3, itl, 31-14  ;test  VE-bit 

jmpf  it 3, nxt_column 

srl  it3, itl, PGSIZE+5  ;mask  PTE  VTAG 

sll  it3, it3, PGSIZE+5 

cpeq  it3,it0,it3  ; compare  VTAG 

jmpf  it 3, nxt_column 

r 

mfsr  it3,cha  ; find  TLB  entry 

srl  it3, it3, PGSIZE-1; get  TLB  set 

and  it3,it3,0x3e 

mfsr  ktO, cha 

srl  ktO, ktO, PGSIZE+5; form  adds  VTAG 

sll  ktO, ktO, PGSIZE+5 

mftlb  it0,it3  ; read  Word  0 

srl  itO, itO, PGSIZE+5; form  TLB  VTAG 

sll  itO, itO, PGSIZE+5 
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cpeq  itO,itO,ktO 

jmpt  itO, entry_found 

sub  it2,it2,8  ;PTE  adds  word  0 

add  it3,it3,64  ;Word  0  in  set  1 

Once  the  PTE  and  TLB  entries  have  been  found  execution  continues  at  label 
entry_found.  Both  entries  must  now  be  updated  to  set  the  UW  bit  enabling  User 
mode  write  access.  In  addition,  the  PGM[1]  bit  used  to  keep  a  record  of  any  data 
writes  to  the  SRAM  page  is  also  set.  This  bit,  known  as  the  dirty-bit,  will  be  used  in 
the  page-out  algorithm.  Once  the  TLB  register  reporting  the  access  permission  fault 
has  been  updated,  an  IRET  instruction  is  used  to  restart  the  program  using  the  con¬ 
tents  of  the  still  frozen  special  registers. 

entry_found: 

;Word  0  in  itl,  it2  points  to  PTE  word  0 


const 

ktl, 0x200 

; UW-bit 

or 

itl, itl, ktl 

store 

0,0, itl, it2 

; store  new  word  0 

mttlb 

it3, itl 

; assign  Word  0 

add 

it2 , it2 , 4 

load 

0,0, itO, it2 

; load  word  1 

add 

it3, it3, 1 

or 

itO, itO, 0x80 

; set  PGM[1]  dirty 

store 

0,0, itO, it2 

; store  new  word  1 

and 

itO, itO, Oxcl 

;mask  out  RPN 

sub 

it 1 , it2 , upte 

; set  offset 

srl 

itl, itl , 3 

; set  index 

sll 

itl, itl, PGSIZE 

; cache  page  offset 

add 

itl, itl, upte 

; cahe  RPN 

or 

itO, itl, itO 

;or  in  cache  RPN 

mttlb 

iret 

it3, itO 

; assign  Word  1 

6.4.5  Supervisor  TLB  Signal  Handler 

When  trapware  code  is  unable  to  complete  the  necessary  TLB  update,  for  exam¬ 
ple,  if  the  corresponding  address  is  for  a  page  not  currently  in  the  cache,  the  operating 
system  receives  a  signal  and  information  on  its  memory  stack  required  to  continue 
the  TLB  update  process.  An  IRET  instruction  is  used  to  trampoline  to  the  signal  han¬ 
dler  address  tlb_sig_handler.  The  IRET  does  not  cause  the  faulting  User  mode 
instruction  to  restart,  because  after  the  frozen  special  registers  are  saved  on  the  stack, 
the  PC  registers  are  loaded  with  the  address  of  the  signal  handler.  Additionally,  the 
OPS  status  register  is  modified  to  cause  Supervisor  mode  with  non-translated  ad¬ 
dress  to  commence  after  the  IRET,  rather  than  the  interrupted  User  mode  with  ad¬ 
dress  translation  on. 

A  small  number  of  support  registers  were  required  to  support  the  trapware  rou¬ 
tines.  The  higher  level  signal  handler  code  requires  registers  for  its  own  operation.  It 
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is  undesirable  to  use  some  of  the  remaining  operating  system  registers  in  the 
gr64-gr95  range  to  support  this  code.  Global  registers  are  a  scarce  resource  and  like¬ 
ly  needed  by  other  critical  operating  system  tasks.  The  registers  used  by  the  trap  han¬ 
dlers  ( it0-it3 )  are  by  convention  used  by  all  Freeze  mode  handlers,  since  during 
Freeze  mode,  interrupts  are  disabled  and  therefore  there  are  no  register  access  con¬ 
flicts.  However,  the  signal  handler  code  runs  with  interrupts  turned  on.  An  interrupt 
occurring  during  signal  processing  would  likely  use  the  interrupt  temporary  registers 
(i it0-it3 ),  and  therefor  the  signal  handler  must  acquire  additional  registers  for  its  op¬ 
eration.  It  does  this  by  pushing  some  of  the  User  mode  assigned  global  registers 
( gr96-grl27)  onto  the  Supervisor  stack,  just  below  the  signal  frame. 


;Try  and  find  an  empty  PTE  entry  in  the  column. 
/Register  tav  has  the  offending  address. 


tlb_sig_handler : 


push 

kmsp,  gr96 

/get  some  registers 

push 

kmsp,  gr97 

push 

kmsp,  gr98 

push 

kmsp,  gr99 

push 

kmsp,  grlOO 

mf  sr 

gr96,  tmc 

/get  random  value 

srl 

gr98,tav,PGSIZE 

/ select  cache  set 

and  gr98,gr98, ( 1<<C_SETS ) -1 

sll  gr 98 , gr 98 , 3+C_COLUMNS ; PTE  set  offset 

add  gr 98 , gr 98 , upte  ;PTE  column  0  address 


const 
column_loop : 
jmpt 
and 
add 
load 
add 
sll 
jmpt 
sub 


grlOO, (1<<C_C0LUMNS) -1 
grlOO, page_out 

gr96,  gr96,  (  ( 1<<C_C0LUMNS ) -1 )  «3 
gr99, gr98, gr96  /column  wrap-around 
0, 0, gr97, gr99  /load  word  0 

gr96,gr96,8  /next  PTE  entry 

gr99, gr97 , 31-14  /test  VE-bit 
gr99, column_loop 

grlOO, grlOO, 1  /dec  column  count 


sub 

call 

add 

page_in : 


gr96, gr 96, 8 

grlOO, store_locals; destroys  gr96 
gr98, gr98, gr96  /PTE  adds  of  Word  0 

/Page-in  code  follows  . 


The  four  PTE  entries  associated  with  the  set  are  then  scanned  to  find  an  unused 
entry  (i.e.,  the  VE  bit  is  not  set).  If  all  PTEs  are  marked  valid,  then  execution  contin¬ 
ues  at  page_out.  Once  a  empty  entry  is  found  a  call  to  routine  store_locals  is  made. 
This  call  causes  all  128  local  registers  within  the  Am29000  processor  to  be  copied 
onto  the  Supervisor  memory  stack  just  below  the  user’s  saved  global  registers.  Note, 
when  the  set  of  four  PTEs  are  scanned,  a  random  column  in  the  set  is  initially  se- 
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lected.  This  may  initially  reduce  column  scan  times.  After  the  local  registers  have 
been  made  available  for  signal  handler  use,  execution  continues  at  label  page_in. 

6.4.6  Copying  a  Page  into  the  Cache 

Once  a  PTE  for  the  in-coming  page  has  been  selected  the  corresponding  SRAM 
cache  page  can  be  easily  determined  with  a  little  address-based  calculation.  Words  0 
and  1  for  the  TLB  entry  are  now  formed  and  stored  in  the  TLB  selected  by  the  LRU 
register.  The  TLB  entry  is  also  copied  to  the  PTE  location,  with  the  one  difference 
that  PTEs  have  the  DRAM  page  number  in  the  RPN  filed  rather  than  the  SRAM  page 
number. 

The  Dirty  bit,  PGM[0] ,  is  cleared  and  the  page  is  marked  for  read  and  execute 
permissions,  unless  the  signal  is  from  a  failing  data  write  access;  in  this  case,  the  page 
is  marked  dirty  and  write  permission  is  granted.  To  determine  if  a  write  access  failed, 
the  channel  control  register  CHC  is  checked  for  a  valid  data  write  access  in  progress. 
The  CHC  register  is  obtained  by  referencing  the  signal  frame  stored  on  the  Supervi¬ 
sor  memory  stack.  Lortunately,  the  LRU  register  did  not  need  to  be  saved  on  the 
memory  stack,  because  the  LRU  will  remain  unchanged  during  signal  code  execu¬ 
tion.  The  LRU  register  is  only  updated  when  an  address  translation  fails,  this  can  not 
happen  when  the  operating  system  is  running  in  physical  address  mode. 

The  DRAM  page  is  copied  into  SRAM  memory  in  bursts  of  128  words.  Bursting 
is  repeated  several  times  depending  on  page  size.  Using  long  data  bursts  to  transfer 
data  is  most  efficient.  The  LOADM  and  STOREM  instructions  remain  in  execute  un¬ 
til  all  their  data  has  been  transferred,  which  is  only  dependent  on  the  access  delay  of 
the  memory.  Once  the  SRAM  page  has  been  filled,  the  user’s  local  registers  are  re¬ 
paired  via  a  call  to  load_locals  and  a  jump  to  ret_usr  starts  the  process  of  restoring 
the  processor  to  its  state  at  the  time  of  the  trap. 


gr96, tav, PGSIZE+5; form  VTAG 

gr96,gr96,PGSIZE+5 

gr97,mmu  ; get  TID 

gr97, gr97, Oxff 

gr96, gr96, gr97  ;or  in  TID 


page_in : 

srl 
sll 
mf  sr 
and 
or 

t 

const 

const 

add 

load 

mtsrim 

extract 

jmpf 

const 

mtsrim 

extract 

jmpt 

nop 


grlOO, 0x00 
gr 97 , 512  +  5*4 
gr97, kmsp, gr97 
0 , 0 , gr 97 , gr 97 
fc, 31-0 

gr97, gr97, gr97 
gr 97 , i_page 
gr99, 0x4500 
fc, 1+31-15 
gr97, gr97, gr97 
gr 97 , i_page 


;PGM[1]=0  clean 
+  CTX_CHC 
; get  chc 


; rotate 
;test  CV-bit 
; VE | UR | UE 

; rotate  LS-bit 
; jump  for 
;data  load 
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const  gr99, 0x4700  ;VE|UR|UW|UE 

const  grl00,0x80  ;PGM[1]=1  dirty 

i_page : 

or  gr97, gr96, gr99  ; or  in  permissions 

store  0 , 0 , gr 97 , gr 98  ; store  Word  0 

mfsr  gr96, lru 

mttlb  gr96,gr97  ; assign  TLB  word  0 

t 

add  gr96, gr96, 1 

add  gr98,gr98,4  ;PTE  adds  Word  1 

srl  gr97, tav, PGSIZE; 

sll  gr 97 , gr 97 , PGSIZE 

or  gr 97 , gr 97 , grlOO ; assign  PGM[1] 

store  0, 0, gr97, gr98  ; store  Word  1 

sub  gr99, gr98, upte  ; set  offset 

srl  gr99,gr99,3  ; set  index 

sll  gr99, gr99, PGSIZE; cahe  page  offset 

add  gr99, gr99, upte  ; cache  RPN 

mttlb  gr96,gr99  ; assign  TLB  word  1 

r 

mtsrim  cr, 128-1 

const  gr96,  ( 1<<PGSIZE) /512 ; busrt  count 

sub  gr96, gr96, 2 

const  grlOO, 512 

srl  gr 97 , tav, PGSIZE  ; get  page  address 

sll  gr 97 , gr 97 , PGSIZE 

more_in : 

loadm  0,0,lr0,gr97  ; read  in  a  block 

storem  0,0,lr0,gr99  ; copy  out  a  block 

add  gr 97 , gr 97 , grlOO  /advance  pointer 

jmpfdec  gr96,more_in 

add  gr99, gr99, grlOO  /advance  pointer 

r 

call  grlOO, load_locals; destroys  gr96 

nop 

jmp  ret_user 

nop 

6.4.7  Copying  a  Page  Out  of  the  Cache 

If  a  TLB  miss  occurs  and  all  PTE  entries  for  the  associated  set  are  marked  valid, 
then  a  PTE  must  be  selected  and  the  corresponding  SRAM  page  copied  back  to 
DRAM.  This  makes  room  for  the  page  containing  the  miss  addresses  to  be  copied 
into  the  space  made  available  by  the  out-going  page.  The  PTEs  for  the  set  are 
scanned  and  if  a  non-dirty  page  is  found,  it  is  selected  for  paging-out.  If  all  pages  are 
marked  dirty,  then  a  jump  to  label  all_dirty  is  taken,  as  further  column  scanning  is 
required  to  determine  if  a  page  can  be  paged-out. 

/All  columns  are  in  use.  Select  a  column  which  is  not  locked  and 
not  /dirty  for  paging  out. 

/Register  gr98  points  to  a  random  column  in  current  set. 


Chapter  6  Memory  Mangement  Unit 


323 


page_out : 

call  grlOO , store_locals / destroys  gr96 
add  gr98,gr98,4  ;pnts  to  PTE  word  1 

mfsr  gr96,tmc  ; get  random  number 

const  grlOO, ( 1<<C_C0LUMNS ) -1 ; column  counter 


dirty_loop : 

jmpt 

and 

add 

load 

add 

sll 

jmpt 

sll 

jmpt 

sub 


grlOO, all_dirty 

gr96, gr96, ( ( 1<<C_C0LUMNS ) -1 ) <<3 
gr99, gr98, gr96  ; column  wrap 


0,0, gr97, gr99 
gr96, gr96, 8 
gr99, gr97 , 31-7 
gr99, dirty_loop 
gr99, gr97 , 31-6 
gr99, dirty_loop 
grlOO, grlOO, 1 


;  load  PTE  word  1 
;  next  TLBT  entry 
;test  PGM[1]  dirty 

;test  PGM[0]  locked 

;dec  column  count 


Once  a  PTE  for  the  out-going  page  is  selected,  the  two  TLB  entries  for  the 
associated  set  must  be  checked  to  determine  if  they  are  caching  an  entry  for  the  se¬ 
lected  PTE.  If  there  is  a  valid  TLB  entry,  then  it  must  be  marked  invalid  as  the 
associated  SRAM  page  is  about  to  be  assigned  to  a  different  virtual  page  address. 


page_selected : 

;Must  first  page-out  selected  cache  page  before  filling  the  cache 
;with  the  new  selected  page, 
sub  gr96, gr96, 8 

add  gr98,  gr98,  gr96  ; adds  of  PTE  Word  1 

r 

/Invalidate  any  processor  TLB  entries  for  the  outgoing  page. 
/Could  check  VE  bit  in  each  TLB  entry  first, 
srl  gr 96,  gr 97  ,  PGSIZE+5/ f orm  VTAG 

sll  gr96, gr96, PGSIZE+5 

srl  grlOO , gr 97 , PGSIZE-1 / get  TLB  set 

and  grlOO, grlOO, 0x3e 

mftlb  gr99, grlOO  /read  Word  0 

srl  gr99,  gr99,  PGSIZE+5/ form  VTAG 

sll  gr99, gr99, PGSIZE+5 

cpeq  gr99, gr99, gr96 

jmpf  gr99, test_column_l 

invalidate_tlb : 

const  gr99,  0  /clear  TLB  VE-bit 

jmp  tlb_clear 

mttlb  grlOO, gr99 

test_column_l : 

add  grlOO,  grlOO,  64  /Word  0  in  column  1 

mftlb  gr99, grlOO 

srl  gr99, gr99, PGSIZE+5/ form  VTAG 

sll  gr99, gr99, PGSIZE+5 

cpeq  gr99, gr99, gr96 

jmpt  gr99, invalidate_tlb 

nop 
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It  is  during  the  page-out  routine  that  the  maintenance  of  a  dirty-bit  pays  back  its 
dividend.  If  the  page  is  not  dirty  then  there  is  no  need  to  copy  it  back  to  DRAM,  be¬ 
cause  the  DRAM  copy  is  exactly  the  same  as  the  SRAM  copy.  If  no  writes  have  oc¬ 
curred  to  the  page  then  the  copy-out  is  avoided. 


tlb_clear : 

sll  gr96, gr97 , 31-7  ;test  dirty  bit 

jmpf  gr96,page_in 

sub  gr98,gr98,4  ;gr98  pnts.  word  0 

srl  gr 97 , gr 97 , PGSIZE; secondary  mem  RPN 

sll  gr 97 , gr 97 , PGSIZE 

sub  gr99, gr98, upte  ; set  offset 

srl  gr99,gr99,3  ; set  index 

sll  gr99, gr99, PGSIZE; cache  page  offset 

add  gr99, gr99, upte  ; cache  RPN 


mtsrim  cr, 128-1 

const  gr96,  ( 1<<PGSIZE) /512 ; burst  count 
sub  gr96, gr96, 2 

const  grlOO, 512 


The  page-out  routine,  like  the  page-in  routine  makes  use  of  burst-mode  data 
copying  to  greatly  speed  up  the  processes  of  data  moves. 


more_out : 

loadm  0,0,lr0,gr99  ; read  in  a  block 

storem  0,0,lr0,gr97  ; copy  out  a  block 

add  gr 97 , gr 97 , grlOO  ; advance  pointer 

jmpfdec  gr96, more_out 
add  gr99,  gr99,  grlOO  ; advance  pointer 

jmp  page_in  ;gr98  pnts  word  0 

nop 


6.4.8  Cache  Set  Locked 

The  signal  processing  software,  like  the  trapware,  has  its  code  ordered  to  deal 
with  the  most  frequently  occurring  events  first.  This  results  in  shorter  processing 
times.  There  is  no  need  to  burden  the  simpler  tasks  with  overheads  supporting  the 
operation  of  less  frequently  occurring  events.  However,  this  does  lead  to  some  repeti¬ 
tion  in  code  for  the  most  infrequent  signal  processing  events.  For  example,  if  a  page 
must  be  copied-out  and  all  the  pages  are  marked  dirty,  then  the  PTEs  in  the  set  must 
be  scanned  again  to  find  a  unlocked  page.  The  selected  page  is  then  paged-out. 

;A11  pages  are  dirty,  page-out  a  non  locked  page 

all_dirty : 

const  grlOO, ( 1<<C_C0LUMNS ) -1 ; column  counter 
lock_loop : 

jmpt  grlOO, cache_locked 
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and 

gr96,gr96,  (  (1«C_ 

.COLUMNS) -1) <<3 

add 

gr99, gr98, gr96  ; 

;  column  wrap 

load 

0,0, gr97, gr99  ; 

:  load  word  1 

add 

gr96, gr96, 8  ; 

:  next  PTE  entry 

sll 

gr99, gr97 , 31-6  ; 

;test  PGM[0]  lock 

jmpt 

gr99, lock_loop 

sub 

grlOO, grlOO, 1  ; 

:dec  column  count 

jmp 

page_selected : 

nop 


If  all  pages  associated  with  the  current  set  are  marked  locked,  then  the  signal 
handler  arranges  to  have  the  DRAM  page  mapped  directly  to  the  faulting  virtual  ad¬ 
dress.  This  reduces  the  access  times  for  all  data  and  instructions  contained  in  the 
page.  The  algorithm  does  not  try  and  restore  the  page  to  SRAM  at  a  later  date 

;A11  columns  for  the  current  set  are  locked. 

; Map  the  virtual  address  to  non-cache  secondary  memory. 

cache_locked : 


srl 

gr96, tav, PGSIZE+5; form  VTAG 

sll 

gr96,gr96,PGSIZE+5 

mf  sr 

gr 97 , mmu 

; get  TID 

and 

gr97, gr97, Oxff 

or 

gr96, gr96, gr97 

; o r  in  TID 

const 

gr97, 0x4700 

;  VE  |  UR  |  UW  |  UE 

or 

gr97, gr96, gr97 

;  or  in  permiss 

mf  sr 

gr 98 , lru 

mttlb 

gr98, gr97 

; assign  Word  0 

add 

gr 98 , gr 98 , 1 

srl 

gr96,tav,PGSIZE 

; form  RPN 

sll 

gr96, gr96, PGSIZE 

mttlb 

gr98, gr96 

; assign  Word  1 

6.4.9  Returning  from  Signal  Handler 

When  the  signal  handler  has  completed  its  processing,  the  context  of  the  proces¬ 
sor  at  the  time  of  the  original  TLB  trap  must  be  restored  and  execution  continued. 
First,  the  user’s  global  registers,  temporarily  made  use  of  by  the  operating  system, 
must  be  restored  from  the  Supervisor  memory  stack.  Interrupts  must  be  disabled  and 
the  processor  state  frozen  while  the  special  support  registers  are  restored  from  the 
signal  frame.  Once  this  has  been  accomplished  and  the  memory  stack  is  restored  to  its 
pre-trap  value,  an  IRET  instruction  is  used  to  restart  the  instruction  in  execute  at  the 
time  the  translation  trap  was  taken. 


;Pop  registers  of  supervisor  mode  stack  and 
; return  to  program  causing  the  TLB  miss. 

ret_user : 
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pop 

grlOO, kmsp 

pop 

gr99, kmsp 

pop 

gr 98 , kmsp 

pop 

gr 97 , kmsp 

pop 

gr96, kmsp 

mtsrim 

cps , 0x73 

; disable  interrupts 

pop 

tav, kmsp 

; restore  tav 

mtsrim 

cps , 0x4  7  3 

; turn  on  FREEZE 

popsr 

ops, itO, kmsp 

popsr 

alu,  itO, kmsp 

popsr 

chc,  itO, kmsp 

popsr 

chd,  itO , kmsp 

popsr 

cha,  itO, kmsp 

popsr 

pc2  ,  it 0 , kmsp 

popsr 

pci,  itO, kmsp 

popsr 

pcO ,  itO , kmsp 

pop 

rab, kmsp 

;pop  rab 

pop 

itl, kmsp 

; pop  rsp 

add 

grl , itl , 0 

; alu  operation 

add 

iret 

kmsp, kmsp, 4 

; discount  signal 

6.4.10  Support  Routines 

The  example  code  used  two  support  routines  to  copy  the  128  32-bit  local  regis¬ 
ters  to  and  from  the  Supervisor  memory  stack.  Most  operating  systems  assign  all  of 
the  local  registers  for  use  by  the  user’s  application  code.  The  large  number  of  regis¬ 
ters  effectively  implements  a  data  cache.  The  advantage  to  having  several  registers  is 
that,  unlike  data  memory,  the  register  file  supports  simultaneous  read  and  write  ac¬ 
cess.  In  order  to  support  maximum  length  data  bursts  on  page  transfers,  the  register 
file  is  made  available  to  the  signal  processing  routine. 

;Push  local  registers  onto  Supervisor  M-stack 

store_locals : 

const  gr96,512  ;Window  Size 

sub  kmsp, kmsp, gr96 

mtsrim  cr, 128-1  ; save  128  registers 

jmpi  grlOO  ; return 

storem  0,0,lr0,kmsp 


;Pop  local  registers  off  Supervisor  M-stack 


load_locals : 

const 

mtsrim 

loadm 

jmpi 

add 


gr96, 512 
cr, 128-1 
0 , 0 , IrO , kmsp 
grlOO 

kmsp, kmsp, gr96 


;  Window  Size 
; load  128  registers 

; return 
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6.4.11  Performance  Gain 


The  benefits  of  using  a  software  controlled  cache  to  take  advantage  of  limited 
SRAM  availability  should  be  seen  in  reduced  average  number  of  cycles  per  applica¬ 
tion  instruction.  Ideally  the  cache  performance  should  approach  that  of  a  single  large 
SRAM  memory  system.  However,  the  cost  of  TLB  and  cache  maintenance  is  not  in¬ 
significant,  especially  when  small  page  sizes  are  used.  Figure  6-16  and  Figure  6-17 
show  the  effective  average  cycle  times  per  instruction  observed  for  a  128  page  cache 
system.  The  cache  memory  was  2-1  and  the  secondary  memory  4-1. 

3 


2 


1 

0 

lk  2k  4k  8k  page  size  (bytes) 

□  4-1  DRAM  joint  1/D  memory  system 
4-1  DRAM  with  128  page  2-1  cache 

□ 

Figure  6-16.  Cache  Performance  Gains  with  the  Assembly  Utility 

Compare  results  for  the  smallest  cache  system  of  128  IK  byte  cache  pages.  The 
effective  performance  is  more  divergent  from  the  maximum  achievable  SRAM  per¬ 
formance  with  this  cache  size.  When  the  page  size  is  2K  bytes  or  greater,  the  cache 
overhead  reduces  noticeably.  With  a  DRAM-only  system,  an  8K  byte  page  size 
would  be  selected  to  reduce  TLB  handler  support  overheads.  This  means  the  128K 
byte  cache  model  should  really  be  compared  with  the  8K  byte  DRAM  only  model.  In 
this  case,  the  cache  achieved  an  average  performance  gain  of  28%  for  the  two  utility 
programs  tested. 

Using  a  cache  has  some  additional  benefits  for  embedded  systems.  Often  initial¬ 
ization  code  and  data  are  placed  in  EPROM,  which  can  be  slow  to  access.  When  the 
EPROM  is  accessed,  the  associated  page  would  be  automatically  copied  to  SRAM. 
Additionally,  application  read/write  data  which  is  not  located  in  a  uninitialized  data 
(BSS)  section,  and  therefore  requires  initialization  before  the  application  program 
commences,  would  be  automatically  initialized  from  the  EPROM  data  pages.  This 
will  remove  the  burden  from  the  operating  system  routines  responsible  for  applica¬ 
tion  environment  initialization. 


Effective  Number 
of  Cycles  per 
Instruction 


assembler 


328 


Evaluating  and  Programming  the  29K  RISC  Family 


3 


Effective 
Number  of 
Cycles  per 
Instruction 


nroff 


lk  2k  4k  8k  page  size  (bytes) 

□  4-1  DRAM  joint  I/D  memory  system 
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Figure  6-1 7.  Cache  Performance  Gains  with  NROFF  Utility 


The  software  controlled  cache  benefits  become  larger  when  the  secondary 
DRAM  memory  becomes  relatively  slower.  Figure  6-18  shows  a  comparison  of  a 
4-2  DRAM  system  with  a  128  1Kbyte  page  SRAM.  The  benchmark  programs  show 
an  average  performance  gain  of  39.4%. 
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Figure  6-18.  Comparing  Cache  Based  Systems  with  DRAM  Only  Systems 
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Chapter  7 


Software  Debugging 


This  chapter  supports  software  engineers  wishing  to  develop  application  code 
or  operating  system  code  for  execution  on  a  29K  RISC  microprocessor. 

Debugging  tools  which  can  be  used  in  both  a  hardware  and  software  debugging 
role,  such  as  in-circuit  emulators  and  logic  analyzers,  are  not  described;  that  is  left  to 
the  individual  tool  manufacturer.  The  material  presented  concentrates  on  describing 
the  operation  of  inexpensive  tools  based  on  the  MiniMON29K  debug  monitor  and 
Universal  Debug  Interface  (UDI).  Figure  7-1  shows  the  various  tools  used  during 
the  different  stages  of  an  embedded  processor-based  project.  Debug  monitors  are 
typically  used  during  the  initial  processor  evaluation  and  selection  stage,  and  later 
when  software  is  debugged  with  a  working  hardware  system. 

Also  described  are  processor  features  which  were  specifically  included  in  the 
design  for  the  purpose  of  debugging.  The  precise  details  of  how  these  features  are 
configured  to  build  a  debug  monitor  such  as  MiniMON29K  will  not  be  described  in 
detail.  This  chapter  is  not  intended  to  show  how  debug  tools  are  constructed,  but  rath¬ 
er  to  show  how  existing  tools  can  be  utilized  and  describe  their  inherent  limitations. 
However,  readers  wishing  to  build  their  own  tools  will  be  able  to  glean  the  informa¬ 
tion  required. 

7.1  REGISTER  ASSIGNMENT  CONVENTION 

The  29K  processor  calling  convention  divides  the  processor  registers  into  two 
groups:  those  available  to  the  run-time  application,  and  hence  used  by  compiler  gen¬ 
erated  code,  and  those  reserved  for  operating  system  use. 

All  the  29K  processor’s  128  local  registers,  used  to  implement  the  register  stack 
cache,  are  allocated  to  application  code  use.  In  addition,  32  ( gr96-grl27 )  of  the  64 
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global  registers  are  assigned  to  application  use,  and  the  remaining  group  of  32 
( gr64-gr95 )  are  for  operating  system  use. 

The  processor  does  not  assign  any  particular  task  to  the  global  registers  in  the 
operating  system  group.  However,  over  time  a  convention  has  evolved  among  29K 
processor  users.  The  subdivision  of  global  registers  gr64-gr95  into  sub  groups  was 
described  in  section  3.3,  and  is  widely  adhered  to;  the  methods  presented  in  this  chap¬ 
ter  shall  continue  with  the  convention. 

The  subgroups  are  known  as:  The  interrupt  Freeze  mode  temporaries  (given 
synonyms  it0-it3 );  the  operating  system  temporaries  (. ktO-ktll );  and  the  operating 
system  static  support  registers  ( ks0-ksl5 ). 

7.2  PROCESSOR  DEBUG  SUPPORT 
7.2.1  Execution  Mode 

The  processor  is  in  Supervisor  mode  whenever  the  SM-bit  in  the  Current  pro¬ 
cessor  Status  register  (CPS)  is  1.  If  the  SM  bit  is  0,  the  processor  is  executing  in  User 
mode.  When  operating  in  User  mode  the  processor  cannot  access  protected  re¬ 
sources  or  execute  privileged  instructions. 

Generally  a  processor  maintains  context  information  which  refers  to  operating 
system  status  and  various  user  processes.  Operating  in  User  mode  is  a  means  of  pre¬ 
venting  a  User  mode  process  from  accessing  information  which  belongs  to  another 
task  or  information  that  the  operating  system  wishes  to  keep  hidden. 

If  a  User  mode  task  breaks  any  of  the  privilege  rules  described  in  the  processor’s 
User  Manual,  then  a  protection  violation  trap  is  taken.  Traps  cause  the  operating  sys¬ 
tem  to  regain  control  of  execution.  Typically  the  operating  system  will  then  send  a 
software  signal  to  the  User  mode  process  reporting  its  violation  and  possibly  stop¬ 
ping  its  execution.  The  exact  action  which  takes  place  is  particular  to  each  operating 
system  implementation. 

Besides  preventing  User  mode  programs  from  using  processor  instructions 
which  are  reserved  for  operating  use  only,  an  operating  system  can  precisely  control  a 
processes  access  to  memory  and  registers.  This  can  be  very  useful  when  debugging 
User  mode  software.  The  following  section  describes  the  processor’s  memory  man¬ 
agement  support.  The  register  protection  scheme  is  very  simple.  Special  register 
RBP  is  used  to  restrict  banks  of  global  registers  to  Supervisor  mode  access  only.  Each 
bank  consists  of  16  registers  and  a  1  in  each  RBP  bit  position  restricts  the  correspond¬ 
ing  bank  to  Supervisor  mode  access  only.  Thus,  it  is  normal  to  set  RBP=0x3F,  which 
allows  User  mode  processes  to  access  global  registers  gr96  and  higher.  These  are  the 
only  registers  which  can  be  affected  by  compiler  generated  code.  Note  however, 
global  registers  grO  and  grl  which  perform  special  support  tasks  are  effected  by  com¬ 
piler  generated  code  and  their  access  is  not  restricted  by  the  RBP  protection  scheme. 
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7.2.2  Memory  Access  Protection 

A  number  of  the  29K  processor  family  members  are  equipped  with  a  Transla¬ 
tion  Look-aside  buffer  (TLB).  It  is  intended  for  construction  of  a  Memory  Manage¬ 
ment  Unit  (MMU)  scheme.  A  complete  description  of  the  TLB  operation  is  given  in 
Memory  Management  Unit  (Chapter  6). 

An  MMU  is  normally  used  to  provide  virtual  memory  support.  However,  it  can 
also  play  an  important  debugging  role,  even  in  embedded  applications.  Note,  this 
function  is  not  intended  to  be  performed  by  the  Region  Mapping  facility  provided  on 
some  family  members.  The  Region  Mapping  facility  does  not  support  the  address 
space  granularity  supported  by  the  TLB  hardware.  In  addition,  Region  Mapping  in 
some  cases  only  allows  address  mapping  to  a  limited  region  of  physical  memory.  For 
example,  on  the  Am29200  microcontroller,  only  the  DRAM  memory  and  not  the 
ROM  memory  can  be  accessed  in  virtual  address  space. 

When  code  is  being  developed,  often  an  erroneous  data  reference  will  occur.  If 
no  memory  is  located  at  the  particular  address  then  the  target  memory  system  should 
generate  a  hardware  access  error  (such  as  DERR  or  IERR  on  some  family  members). 
However,  address  aliasing  often  results  in  the  access  being  performed  on  some  other 
address  location  for  which  address  decoding  determines  physical  memory  has  been 
assigned.  This  kind  of  programming  bug  can  be  difficult  to  detect.  Using  the  TLB, 
address  access  errors  can  be  immediately  detected  and  reported  to  the  operating  sys¬ 
tems  via  access  protection  violations. 

The  OS-boot  operating  system,  used  by  many  customers,  can  provide  memory 
access  protection  by  mapping  virtual  address  to  physical  addresses  in  a  one-to-one 
format.  This  is  adequate  for  many  embedded  applications  where  memory  paging 
does  not  occur  and  application  programs  can  be  completely  located  in  available 
memory.  When  an  access  violation  occurs  OS-boot  informs  the  MiniMON29K 
monitor  who  reports  the  violation  to  the  process  controlling  debugging.  The  details 
of  this  mechanism  are  described  in  later  sections. 

Whether  you  intend  using  OS-boot  or  some  other  operating  system,  it  is  likely 
you  would  benefit  from  using  the  on-chip  TLB  hardware  to  support  a  more  powerful 
debug  environment,  via  the  detection  of  invalid  memory  references. 

7.2.3  Trace  Facility 

Using  the  Trace  Facility,  a  program  can  be  executed  one  instruction  at  a  time. 
This  allows  the  execution  of  a  program  to  be  followed  and  the  state  of  the  processor 
to  be  examined  and  modified  after  each  instruction  has  executed. 

The  29K  family  has  a  four  stage  pipeline:  Fetch,  Decode,  Execute  and  Write¬ 
back.  Tracing  is  enabled  by  setting  the  Trace  enable  (TE)  bit  in  the  CPS  register. 
When  an  instruction  passes  from  the  execute  stage  of  the  pipeline  into  the  write-back 
stage,  the  TE  bit  is  copied  into  the  TP  bit.  The  Trap  Pending  (TP)  bit  is  also  located  in 
the  CPS  register,  and  when  it  becomes  set  the  processor  takes  a  trace  trap.  The  Super- 
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visor  mode  code  normally  arranges  for  the  vector  table  entry  for  the  trace  trap  to 
cause  the  debug  monitor  to  gain  control  of  the  processor. 

The  debug  monitor,  normally  MiniMON29K,  uses  the  IRET  instruction  to  re¬ 
start  program  execution  after  the  Trace  trap  handler  has  completed.  Execution  of  an 
IRET  causes  the  Old  Processor  Status  register  (OPS)  to  be  copied  into  the  CPS  regis¬ 
ter  before  the  next  program  instruction  is  executed.  The  TP  bit  in  the  OPS  is  normally 
cleared  by  the  debug  monitor  before  the  IRET  is  executed.  If  the  TE  bit  in  the  OPS  is 
set  then  tracing  of  the  restarted  instruction  sequence  shall  continue  after  executing 
the  IRET. 

Note,  when  the  disable  all  (DA)  bit  in  the  CPS  register  is  set  the  trace  trap  cannot 
be  taken  unless  the  processor  supports  Monitor  mode  (described  below).  Should  the 
program  being  debugged  issue  an  instruction  such  as  ASNEQ,  it  will  then  take  a  trap 
and  the  DA  bit  will  become  set.  The  OPS  and  CPS  registers  will  have  the  TP  bit  set 
but  a  trace  trap  will  not  be  taken.  This  means  that  Freeze  mode  code  (trap  handlers 
which  execute  with  the  DA  bit  set)  cannot  be  debugged  by  a  software  debug  monitor 
unless  the  processor  supports  Monitor  mode.  Most  members  of  the  29K  processor 
family  do  not  support  Monitor  mode. 

7.2.4  Program  Counter  register  PC2 

The  instruction  following  a  branch  instruction,  known  as  the  delay  instruction, 
is  executed  regardless  of  the  outcome  of  the  branch.  This  performance  improving 
technique  requires  that  two  registers  be  used  to  record  the  addresses  of  the  instruc¬ 
tions  currently  in  the  execute  and  decode  stages  of  the  processor  pipeline.  When  a 
branch  is  taken  the  PCO  register  contains  the  address  of  the  target  instruction  as  it  en¬ 
ters  the  decode  stage  of  the  pipeline.  Register  PC  1  always  contains  the  address  of  the 
instruction  in  execute.  When  the  target  instruction  of  a  branch  enters  decode  the 
instruction  in  execute  is  the  delay  slot  instruction  following  the  branch. 

Program  counter  registers  PCO  and  PCI  are  required  to  restart  the  processor 
pipe-line  in  the  event  of  a  trap  or  an  interrupt  occurring.  Many  of  the  synchronous 
traps,  such  as  a  register  access  privilege  violation,  cause  execution  to  be  stopped  with 
the  address  of  instruction  causing  the  violation  held  in  PCI  (execute  address). 
Asynchronous  traps,  such  as  an  external  interrupt,  and  instruction  traps,  such  as  AS¬ 
SERT  instructions,  cause  the  address  of  the  instruction  following  the  one  in  execute 
at  the  time  of  the  interrupt  to  be  held  in  the  PCI  register.  In  fact  when  a  trap  or  inter¬ 
rupt  is  taken  the  PC  register  values  are  frozen  and  used  to  restart  program  execution 
later.  The  frozen  PC  values  are  held  in  a  3  register  PC-buffer.  Of  course,  the  actual  PC 
registers  continue  to  be  used.  Instructions  such  as  MTSR  and  MFSR  (move-to  and 
move-from  special  register)  can  be  used  to  modify  the  PC-buffer  register  values. 

The  address  of  the  instruction  previously  in  execute  and  now  in  write-back  is 
held  in  the  PC2  register.  This  is  very  convenient  because  a  debugger  can  determine 
the  instruction  which  was  in  execute  at  the  time  the  interrupt  or  trap  occurred.  The 
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trace  trap  is  an  asynchronous  trap,  and  thus  after  the  trap  is  taken  the  next  instruction 
about  to  execute  is  addressed  by  PCI.  Some  family  members  support  Instruction 
Breakpoint  registers,  which  can  be  used  to  stop  execution  when  a  certain  address 
reaches  execute.  When  this  occurs  a  synchronous  trap  is  taken  and  the  instruction  is 
stopped  before  execution  is  completed. 

Debug  monitors,  such  as  MiniMON29K,  understand  the  operation  of  the  PC 
registers  and  can  use  them  to  control  program  execution.  When  MiniMON29K  is 
used  with  a  processor  which  has  no  Breakpoint  registers,  a  technique  relying  on  tem¬ 
porarily  replacing  instructions  with  illegal  opcode  instructions  is  used  to  implement 
breakpoints.  Illegal  opcode  instructions  are  used  in  preference  to  trap  instructions  be¬ 
cause  execution  is  stopped  with  the  PC-buffer  recording  execution  a  cycle  earlier. 
That  is,  the  breakpoint  address  is  in  PC  1  rather  than  PC2,  as  would  happen  with  a  trap 
instruction. 

One  further  useful  feature  of  the  PC2  register  occurs  when  breakpoints  are  set  to 
the  first  instruction  of  a  new  instruction  sequence  —  typically  the  first  instruction  of  a 
procedure.  When  the  breakpoint  is  taken  and  program  execution  is  stopped,  the  PC2 
register  contains  the  address  of  the  delay  slot  instruction  executed  before  the  new 
instruction  sequence  started.  This  is  very  useful  in  determining  where  a  program  was 
previously  executing. 

7.2.5  Monitor  Mode 

Monitor  Mode  currently  only  applies  to  a  limited  numbr  of  29K  processors,  see 
Table  7-1.  If  a  trap  occurs  when  the  DA  bit  in  the  CPS  register  is  a  1,  the  processor 
starts  executing  at  address  16  in  instruction  ROM  space.  Monitor  Mode  is  not  entered 
as  a  result  of  asynchronous  events  such  as  timer  interrupts  or  activation  of  the 
TRAP(l-O)  or  INTR(3-0)  lines. 

On  taking  a  Monitor  Mode  trap  the  Reason  Vector  register  (RSN)  is  set  by  the 
processor  to  indicate  the  cause  of  the  trap.  Additionally,  the  MM  bit  in  the  CPS  regis¬ 
ter  is  set  to  1.  When  the  MM  bit  is  set,  the  shadow  program  counters  (SPCO,  SPC1, 
and  SPC2)  are  frozen,  in  a  similar  way  to  the  FZ  bit  freezing  the  PC0-PC2  buffer 
registers.  Because  the  shadow  program  counters  continue  to  record  PC-BUS  activity 
when  the  FZ  bit  is  set,  they  can  be  used  to  restart  Freeze  Mode  execution.  This  is 
achieved  by  an  IRET  or  IRETINV  instruction  being  executed  while  in  Monitor 
Mode. 

Monitor  mode  traps  are  used  by  monitors  in  the  debugging  of  trap  and  interrupt 
handlers  and  are  not  intended  for  operating  system  use. 

7.2.6  Instruction  Breakpoints 

Some  members  of  the  29K  processor  family  support  Instruction  Breakpoint  reg¬ 
isters,  see  Table  7-1.  These  registers  can  be  used  to  stop  a  program’s  execution  when 
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Table  7-1.  29K  Family  On-chip  Debug  Support 


Processor 

Monitor  mode 

virtual 

memory 

instruction 

breakpoints 

data 

breakpoints 

data  breakpoints 
value  ranges 

Am  29000 

no 

yes 

- 

- 

- 

Am 29005 

no 

no 

- 

- 

- 

Am2903x 

no 

yes 

- 

- 

- 

Am 29040 

no 

yes 

2 

1 

no 

Am 29050 

yes 

yes 

2 

- 

- 

Am2920x 

no 

no 

- 

- 

- 

Am2924x 

no 

yes 

- 

- 

- 

Am 29460 

yes 

yes 

2 

1 

yes 

an  instruction  at  a  specified  address  enters  execute.  The  control  mechanism  for 
Breakpoints  is  flexible,  allowing  a  User  process  ID  to  be  specified. 

With  3-bus  processors,  breakpoints  can  be  assigned  to  Instruction  space  or 
ROM  space.  Both  of  these  spaces  normally  contain  instructions  but  the  ROM  space 
typically  contains  ROM  rather  than  RAM  memory  devices.  No  matter  which  kind  of 
memory  device  is  utilized  the  Breakpoint  registers  can  be  used. 

When  a  processor  does  not  support  Breakpoint  registers,  illegal  instructions  or 
traps  are  used  to  stop  execution  at  desired  address  locations.  Debug  monitors  are, 
however,  unable  to  manipulate  instructions  which  are  located  in  ROM  devices.  Thus 
the  main  uses  of  the  Breakpoint  register  is  to  support  breakpoints  when  ROM  devices 
are  in  use.  Additionally,  they  are  used  in  the  rare  case  where  a  3-bus  Harvard  archi¬ 
tecture  memory  system  is  constructed  without  providing  a  means  for  the  processor  to 
read  and  write  instruction  space.  In  this  case  the  processor  will  not  be  able  to  replace 
instructions  at  a  breakpoint  addresses  with  temporary  illegal  instructions. 

The  MiniMON29K  debug  monitor,  described  in  detail  later,  must  make  some 
decision  about  the  values  to  put  in  the  breakpoint  register  fields:  BTE  (break  on 
translation  enable)  and  BPID.  The  debug  tool  user  (Debugger  Front  End  user)  nor¬ 
mally  selects  the  process  identifier  (PID)  of  the  application  process  containing  the 
breakpoint.  However,  the  DFE  often  does  not  know  if  the  29K  operating  system  is 
running  with  address  translation  turned  on.  The  DebugCore  accesses  a  data  structure 
shared  with  the  operating  system  to  determine  the  value  for  the  BTE  field,  see  section 
D.3.4.  The  operating  system  is  required  to  fill  in  the  appropriate  sections  of  the  shared 
data  structure,  informing  the  DebugCore  of  the  CPS  register  value  to  be  used  during 
program  execution.  When  the  PI  (physical  instruction)  bit  in  the  CPS  register  is  clear 
the  BTE  bit  is  set. 
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7.2.7  Data  Breakpoints 

Currently  only  the  Am29040  processor  supports  Data  Breakpoint  registers,  see 
Table  7- 1 .  These  registers  can  be  used  to  stop  a  program’s  execution  when  data  is  ei¬ 
ther  read  or  written  from  an  address  which  lies  within  a  specified  range.  The  control 
mechanism  is  flexible  and  shares  much  of  the  characteristics  provided  by  the  Instruc¬ 
tion  Breakpoints  Control  registers  described  in  the  previous  section. 

When  an  address  match  is  detected,  a  trace  trap  is  taken  after  the  load  or  store 
instruction  is  completed  (this  is  also  true  for  loadm  and  storem  instructions).  When  a 
trap  is  taken,  the  PC  1  registers  points  to  the  instruction  following  the  load  or  store  and 
the  data  transfer  has  occurred. 

To  make  effective  use  of  data  breakpoints  it  is  important  that  the  selected  debug¬ 
ger  has  support  for  controlling  the  operation  of  the  on-chip  support  registers.  Data 
Breakpoint  Control  registers  are  a  relatively  new  feature  and  many  debuggers  have 
not  yet  been  extended  to  incorporate  the  necessary  command  and  control  functions. 

7.3  THE  MiniMON29K  DEBUGGER 

Developers  of  software  for  embedded  applications  are  used  to  working  with 
emulators.  They  enable  code  to  be  down-loaded  to  application  memory  or  installed 
in  substitute  overlay  memory.  This  avoids  having  the  development  delays  associated 
with  mnning  code  from  EPROM.  The  use  of  emulators  may  be  a  necessary  stage  in 
first  getting  the  target  hardware  functional;  for  this  task  their  ability  to  work  with  par¬ 
tially  functioning  hardware  makes  them  indispensable.  However,  once  the  processor 
is  able  to  execute  out  of  target  system  memory  and  a  communications  channel  such  as 
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a  serial  link  is  available,  the  need  for  an  emulator  is  reduced.  Emulators  are  expen¬ 
sive,  and  it  is  not  always  possible  to  make  one  available  to  each  team  member.  The 
use  of  a  debug  monitor  such  as  the  MiniMON29K  monitor  during  the  software  debug 
stage  of  a  project  is  an  economical  alternative  to  an  emulator. 

The  MiniMON29K  monitor  is  not  intended  to  be  a  standalone  monitor.  That  is, 
it  requires  the  support  of  a  software  module  known  as  the  Target  Interface  Process 
(TIP).  The  TIP  executes  on  a  separate  host  processor.  The  embedded  29K  target  pro¬ 
cessor  communicates  with  the  TIP  via  a  serial  link  or  other  higher  performance  chan¬ 
nel  (see  Figure  7-2).  The  User-Interface  process,  known  as  the  Debugger  Front  End 
(DFE),  communicates  with  the  TIP  via  the  inter-process  communication  mechanism 
known  as  UDI  which  is  described  later. 

Most  monitors  do  not  offer  high  level  language  support.  Assembly  code  instruc¬ 
tions  must  be  debugged  rather  than  the  original,  say  C,  code.  Using  GDB  in  conjunc¬ 
tion  with  the  MiniMON29K  monitor  enables  source  level  code  to  be  debugged, 
which  is  far  more  productive  and  necessary  for  large  software  projects.  (More  on  this 
in  the  UDI  section). 

MiniMON29K  has  a  small  memory  requirement,  for  both  instruction  memory 
and  data  memory  of  the  target  29K  system.  The  size  is  reduced  by  implementing 
much  of  the  support  code  in  the  TIP  host  machine,  and  communicating  with  the  target 
via  high-level  messages.  The  amount  of  communication  required  is  reduced  by  in¬ 
corporating  sophisticated  control  mechanisms  in  the  target  DebugCore. 

Much  of  the  following  discussion  in  this  section,  is  concerned  with  describing 
the  operating  principles  of  target  hardware  software  components.  Other  Mini- 
MON29K  components  such  as  MonTIP  and  MonDFE  are  described  in  the  later  UDI 
sections. 

7.3.1  The  Target  MiniMON29K  Component 

The  embedded  portion  of  the  MiniMON29K  monitor  must  be  installed  in  target 
system  ROM  or  down-loaded  by  the  host  via  a  shared  memory  interface.  The  target 
application  code  and  additional  operating  system  code  can  then  be  down-loaded  via 
the  message  system.  If  changes  to  the  code  are  required,  then  the  message  system  can 
be  used  to  quickly  down-load  new  code  without  changing  any  ROM  devices. 

The  software  installed  in  the  target  hardware  consists  of  a  number  of  modules, 
described  in  Figure  7-3.  When  the  embedded  Am29000  processor  is  reset,  the  initial 
operating  system  module,  OS,  takes  control.  This  module  initializes  the  processor 
and  the  other  support  modules.  The  monitor  components  are  required  to  implement  a 
message  communications  driver  and  a  debug  control  core  (DebugCore). 

The  operating  system  module  is  not  part  of  the  MiniMON29K  monitor.  This  al¬ 
lows  developers  to  build  their  own  operating  system  or  make  use  of  a  3rd-party  real¬ 
time  executive  product.  However,  AMD  does  supply  processor  initialization  code 
and  HIF  system  call  support  routines.  HIF  is  an  embedded  system  call  interface  spec- 
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Figure  7-3.  29K  Target  Software  Module  Configuration 


ification,  which  many  of  the  29K  processor  support  library  services  make  use  of.  The 
AMD  supplied  operating  system  code  is  known  as  OS-boot ,  and  it  is  normally  sup¬ 
plied  in  the  same  ROM  containing  the  MiniMON29K  target  component  software. 
(All  of  the  OS-boot  and  MiniMON29K  29K  source  code  is  freely  available  from 
AMD). 

7.3.2  Register  Usage 

The  DebugCore,  message  driver  and  other  MiniMON29K  monitor  modules  do 
not  require  any  processor  registers  to  be  reserved  for  their  use.  This  means  that  all  the 
processor  registers  are  available  for  use  by  the  operating  system  and  application 
code. 

What  this  really  means  is  that  any  registers  temporarily  used  by  MiniMON29K 
code  are  always  restored.  The  only  exception  to  this  occurs  with  global  register  gr4 
and  the  TE  and  TP  bits  of  the  CPS  special  register. 

Global  register  gr4  is  implemented  in  some  members  of  the  29K  family  but  not 
reported  in  the  relevant  User  Manual,  as  it  is  never  used  by  application  or  operating 
system  code.  With  family  members  which  have  no  gr4  register,  the  ALU  forwarding 
logic  can  be  used  to  keep  a  temporary  register  alive  for  1  processor  cycle  following  its 
modification.  The  gr4  data  is  lost  during  the  write-back  stage  when  there  is  no  real 
gr4  register  in  the  global  register  file.  Note,  software  such  as  the  MiniMON29K  De¬ 
bugCore  can  be  difficult  to  debug  because  emulators  also  make  use  of  gr4  in  analyz¬ 
ing  processor  registers. 


340 


Addendum  to  —  Evaluating  and  Programming  the  29K  RISC  Family 


October  13  1995,  Draft  1 


The  TE  and  TP  bits,  located  in  the  Current  Processor  Status  register,  belong  to 
the  MiniMON29K  DebugCore.  However,  the  CPS  register  really  belongs  to  the  op¬ 
erating  system  and  the  OS  should  not  modify  the  TE  and  TP  bits  which  are  main¬ 
tained  by  the  DebugCore.  When  the  operating  system  issues  an  IRET  instruction  it 
updates  the  CPS  register  with  the  contents  of  the  OPS  register.  Normally  the  Debug¬ 
Core  will  set  the  TE  bit  in  the  OPS  before  the  operating  system  performs  an  IRET. 
However,  initially  the  operating  system  must  call  the  support  routine  dbg_iret()  to 
perform  the  IRET  on  behalf  of  the  operating  system.  This  gives  the  DebugCore  an 
opportunity  to  gain  control  of  the  TE  bit. 

7.3.3  The  DebugCore 

The  TIP  host  processor  controlling  the  target  29K  processor  sends  messages  via 
the  available  link  to  the  DebugCore  module.  The  message  system  enables  the  host  to 
examine  and  change  registers  and  memory  in  the  target  hardware.  Program  execu¬ 
tion  can  also  be  controlled  via  instruction  breakpoints  and  single  stepping  instruc¬ 
tions.  Messages  are  provided  specifically  for  controlling  processor  execution. 

The  DebugCore  decodes  the  messages,  giving  access  to  the  29K  processor  reg¬ 
isters  and  the  target  system  memory.  However,  it  does  not  access  the  non-processor 
resources  directly.  The  Configuration  Control  Module  supports  the  peek  and  poke 
functions  shown  below.  These  functions  are  used  for  all  non-register  space  target 
hardware  access.  Note,  all  functions  and  data  variables  defined  in  the  configuration 
MiniMON29K  module  begin  with  the  cfg_  prefix. 

void  cfgj>eek(to,  from,  count,  space,  size) 


void  cfg_poke(to,  from,  count,  space,  size) 

The  peek  function  is  used  to  read  from  target  space  into  temporary  debug  core 
BSS  memory  space.  The  poke  function  is  used  when  writing  to  target  space.  The 
‘space’  parameter  is  used  to  indicate  the  target  address  space,  according  to  the  re¬ 
ceived  message  parameters.  Typical  space  field  values  would  enable  instruction 
space,  data  space  or  I/O  space  access.  The  ‘size’  field  is  used  to  indicate  the  size,  in 
bytes,  of  the  objects  being  transferred.  The  CFG  module  normally  tries  to  make 
memory  accesses  in  the  size  indicated.  However,  if  a  memory  system  does  not  sup¬ 
port,  say,  byte-write  access  to  ROM-space,  then  the  CFG  access  functions  can  be 
configured  to  perform  byte  manipulation  via  word-sized  memory  accesses.  By  keep¬ 
ing  the  access  functions  separate,  a  user  can  configure  the  peek  and  poke  functions 
for  any  special  requirements  without  having  to  understand  the  DebugCore  module. 
Peek  and  poke  functions  are  supplied  for  typical  existing  target  hardware. 

For  example,  if  a  system  uses  Flash  memory  devices,  the  erase  and  program¬ 
ming  sequences  required  to  write  to  Flash  should  be  built  into  the  cfg_poke()  proce- 
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dure.  If  Flash  and  other  device  types,  such  as  DRAM,  are  used  in  the  same  memory 
space,  then  the  cfg_poke()  procedure  can  examine  the  address  value  in  the  ‘to’  pa¬ 
rameter  to  determine  the  correct  operation.  Recent  versions  of  MiniMON29K  have 
included  CFG  module  support  for  Flash  memory  devices. 

When  the  target  processor  stops  executing  operating  system  or  application 
code,  a  context  switch  occurs  into  the  DebugCore  context.  The  state  of  the  processor 
is  recorded  when  switching  context,  thus  enabling  execution  to  be  resumed  without 
any  apparent  interruption.  The  DebugCore  context  may  be  entered  for  a  number  of 
reasons,  such  as:  a  message  was  received  from  the  TIP  host,  an  instruction  breakpoint 
was  encountered,  a  memory  access  violation  occurred.  Whenever  the  DebugCore 
gains  control  a  ‘halt’  message  is  sent  to  the  TIP  host  processor.  The  TIP  host  and  tar¬ 
get  can  then  exchange  messages  as  necessary  to  analyse  or  change  the  state  of  the 
processor  or  memory. 

DebugCore  2.0  shares  a  data  structure  with  the  operating  system  (OS).  Vector 
table  entry  71  is  initializes  by  the  OS  to  point  to  the  data  structure.  Appendix  D  de¬ 
scribes  the  DebugCore  and  OS  interface  in  detail.  The  data  structure  is  mainly  used  to 
pass  the  address  of  entry  points  within  the  two  software  modules.  Address  labels  can 
be  determined  at  link  time.  However,  when,  say,  a  new  OS  is  loaded  at  run  time  it 
must  reconnect  with  the  DebugCore.  This  requires  address  labels  be  available  at  run 
time.  In  addition  to  address  labels  are  various  fields  which  support  the  DebugCore 
installing  per  process  breakpoints  and  requesting  OS  supplied  service  functions. 

7.3.4  DebugCore  installation 

It  is  very  simple  to  install  the  DebugCore  with  any  operating  system.  Mainly, 
what  is  required  is  use  of  a  number  of  Vector  Table  entries  which  are  not  normally 
required  for  operating  system  operation.  And  to  call  the  DebugCore  initialization 
routine  dbg_control().  Figure  7-4  shows  the  vector  table  entries  required.  The  two 
most  obvious  entries  are  for  trap  number  0  (illegal  opcode)  and  number  15  (trace 
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trap).  These  table  entries  point  to  the  DebugCore  entry  labels  dbg_V_bkpt  and 
dbg_V_trace,  respectively.  Note,  all  the  DBG  module  functions  and  data  structure 
names  begin  with  the  dbg_  prefix. 

OS  Cold-Start 

The  operating  system  is  responsible  for  inserting  the  necessary  address  labels 
into  the  vector  table.  Any  vector  table  entries  which  are  not  required  by  the  operating 
system  can  be  defaulted  to  the  DebugCore,  via  the  dbg_trap  entry.  When  this  entry  is 
used,  register  gr64  should  contain  the  original  trap  number  (see  section  D.4. 1  for  al¬ 
ternatives).  It  can  be  very  useful  to  direct  traps  such  as  protection  violation  (number 
5)  and  Data  TLB  protection  Violation  (number  13)  into  the  DebugCore.  This  is  much 
better  than  just  issuing  a  HALT  instruction  in  the  operating  system.  When  a  trap  is 
taken  into  the  DebugCore  a  message  is  sent  to  the  MonTIP  process  which  will  inform 
the  DFE  process  when  execution  has  halted.  DFEs  such  as  MonDFE  and  GDB  under¬ 
stand  the  29K  trap  number  assignment  and  can  report  a  trap  number  13  as  a  User 
mode  data  protection  violation  (Segmentation  fault  in  Unix  language). 

Initializing  the  vector  table  is  part  of  what  is  known  as  the  operating  system 
cold-start  code.  The  operating  system  start-up  sequence  is  shown  in  Figure  7-5. 

When  the  processor’s  power  is  applied,  or  when  the  * RESET  pin  is  asserted,  the 
CPU  begins  executing  instructions  at  location  0  in  ROM  instruction  space  (ROM 
space  and  instruction  space  are  the  same  in  many  29K  family  members).  Control  is 
usually  passed  directly  to  the  operating  system  cold-start  code.  To  save  the  contents 
of  all  the  processor’s  registers  before  the  system  is  initialized,  the  user  may  modify 
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the  code  in  the  operating  system  to  jump  to  the  debugger  core-dump  facility.  Once 
the  registers  have  been  saved,  then  the  cold-start  code  is  executed  by  passing  control 
to  the  os_cold_start  label  in  ROM  space. 

Normally  an  operating  system  will  begin  cold-start  code  immediately  at  ad¬ 
dress  0.  However,  certain  software  bugs  may  cause  program  execution  to  become 
out-of-control,  and  the  only  way  to  regain  control  is  to  activate  the  processor  reset 
pin.  This  is  particularly  the  case  when  the  TLB  registers  are  not  used  by  the  operating 
system  to  implement  address  access  protection.  A  jump  to  dbg_coredump  at  address 
0,  enables  the  processor  states  to  be  recorded  at  the  time  reset  was  asserted.  By  ex¬ 
amining  the  PC  and  channel  special  registers  some  understanding  of  the  cause  of  the 
loss  of  proper  program  execution  may  be  observed.  To  restart  execution  after  the 
core-dump  data  has  been  examined,  a  MiniMON29K  RESET  message  must  be  is¬ 
sued  by  MonTIP.  This  causes  the  dbg_trap_num  variable  to  be  cleared  and  the  pro¬ 
cessor  state  to  be  restored  to  the  hardware  reset  condition  before  execution  is  started 
at  address  os_cold_start. 

DebugCore  2.0  requires  that  vector  table  entry  71  point  to  a  memory  region 
shared  by  the  DebugCore  and  the  operating  system.  The  operating  system  must  ini¬ 
tialized  several  fields  of  the  shared  data  structure,  see  section  D.3.  For,  DebugCore 
1 .0  comparability,  the  data  structure  can  be  initialized  to  zero.  After  the  interrupt  and 
trap  handler  vectors  are  installed,  the  cold-start  code  performs  one-time  initializa¬ 
tion  of  target  system  hardware,  then  calls  msg_init()  to  initialize  the  message  system 
and  and  underlying  communication  drivers.  The  precise  action  taken  by  msg_init()  is 
dependant  on  the  communications  hardware  used  to  support  message  sending. 

When  the  cold-start  sequence  is  complete,  a  call  is  made  to  dbg_control() 
which  initializes  the  DebugCore.  The  point  at  which  the  entry  point  to  the  Debug- 
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Core  is  made  actually  defines  the  boundary  between  operating  system  cold-start  and 
warm-start  code.  The  parameters  passed  to  the  function  are  shown  below: 

return_struct  dbg_control ( 

int  dbg_trap_num,  /*  lr2  value  */ 

int*  os_info_p)  /*  lr3  value  */ 


It  is  called  just  like  a  C  language  routine.  Local  register  lr2  contains  a  copy  of  the 
value  held  at  memory  location  dbg_trap_num.  Register  lr3  contains  the  address  of  a 
data  structure  which  describes  the  memory  layout  of  the  target  system.  The  operating 
system  is  responsible  for  determining  the  amount  and  address  range  of  available 
memory.  Although  this  information  is  passed  to  the  DebugCore,  it  does  not  itself  re¬ 
quire  this  information.  It  merely  keeps  a  record  of  the  relevant  data  structure  address 
so  it  can  pass  the  information  to  the  DFE  process.  Debug  tool  users  interact  with  the 
DFE  and  generally  like  to  know  about  the  target  memory  availability.  Figure  7-6 
shows  the  layout  of  the  structure  passed  to  the  DebugCore.  Note,  where  a  29K  system 
is  based  on  a  single  memory  space  containing  both  instructions  and  data,  the  d_mem, 
i  mem  and  r  mem  parameters  are  the  same. 

The  lr2  parameter  is  required  to  know  if  a  call  to  dbg_coredump  has  already 
been  performed.  Whenever  the  DebugCore  is  entered  the  variable  dbg_trap_num 
takes  on  the  trap  number  causing  DebugCore  invocation;  for  example  number  15 
when  a  trace  trap  occurs.  When  a  core  dump  has  been  performed  then  trap  number 
255  is  recorded.  And  when  the  DebugCore  is  reentered  with  this  number  the  state  of 
the  processor  is  not  recorded  again. 

This  is  necessary  because  the  call  to  dbg_control()  appears  as  a  built-in  break¬ 
point.  Whenever  a  breakpoint  is  taken  the  complete  state  of  the  processor  is  recorded, 
in  effect  a  context  switch  into  the  DebugCore  occurs.  The  original  context  is  restored 
when  the  DebugCore  receives  a  GO  or  STEP  message  from  the  MonTIP  process. 
Whenever  the  DebugCore  gains  control  a  HALT  message  is  sent  to  MonTIP.  Under 
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Figure  7-6.  Operating  System  Information  Passed  to  dbg_control( ) 
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DFE  direction,  MonTIP  can  then  send  messages  to  the  DebugCore  to  examine  and 
change  the  saved  processor  status. 

OS  Warm-Start 

The  DebugCore  records  the  return  address  for  dbg_control()  when  it  is  first 
called.  The  address  is  important  because  it  is  the  start  of  the  operating  system  warm- 
start  code.  When  an  application  program  is  down-loaded  to  the  target  hardware,  an 
INIT  message  is  normally  sent.  The  message  contains  information  extracted  from  the 
application  COFF  file.  This  information  along  with  other  operating  system  run-time 
support  data  is  passed  to  the  operating  system  when  the  dbg_control()  function  re¬ 
turns.  As  is  normal  for  C  procedures,  the  return  information  is  placed  in  global  regis¬ 
ters  starting  with  gr96.  Figure  7-7  shows  the  format  of  the  operating  system  warm- 
start  data. 

After  the  DFE  (MonDFE  for  example)  has  been  instructed  to  load  a  new  pro¬ 
gram  into  memory,  the  return  registers  can  be  examined  to  verify  their  contents. 
Note,  with  some  DFEs  it  is  possible  to  load  a  COFF  file  without  sending  an  INIT  mes¬ 
sage.  In  this  case  the  return  registers  are  not  affected  and  the  PCs  are  not  forced  to  the 
dbg_control()  return  address. 

After  loading  a  program  a  user  will  normally  start  execution,  which  causes  the 
DebugCore  to  switch  out  of  context  and  restore  the  context  described  in  the  register 
shadow  memory.  If  an  INIT  message  was  received  then  execution  will  commence  in 
the  operating  system  warm-start  code.  Otherwise,  it  continues  from  wherever  the  re¬ 
stored  PC  registers  are  located.  Warm-start  code  normally  examines  the  return  struc¬ 
ture  values  and  prepares  the  operating  system  run-time  support  accordingly.  For  ex¬ 
ample,  register  grlOO  contains  the  start  address  of  the  down-loaded  application  pro¬ 
gram.  The  address  value  may  be  loaded  in  the  PC-buffer  registers  before  an  IRET 
instruction  is  used  to  start  program  execution.  However,  it  is  important  to  note  that 
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the  warm-start  operation  is  entirely  operating  system  dependent,  and  the  code  need 
pay  no  attention  to  the  return  structure  information.  The  operation  of  OS-boot,  nor¬ 
mally  supplied  along  with  MiniMON29K,  is  described  in  a  later  section. 

7.3.5  Advanced  DBG  and  CFG  Module  Features 

Normally  the  call  to  dbg_control()  implies  that  a  built-in  breakpoint  should  be 
taken.  This  gives  the  user  an  opportunity  to  down-load  an  application  program  be¬ 
fore  execution  is  continued.  However,  by  setting  the  call  lr2  parameter  to  V_NOBRK 
(254),  no  breakpoint  will  be  taken  and  the  call  will  return  with  no  need  for  a  GO  mes¬ 
sage  from  MonTIP.  This  enables  the  DebugCore  to  be  initialized  for  operation,  and  is 
useful  where  there  is  no  requirement  to  download  an  application  program.  Of  course 
there  are  no  call  return  values  for  the  operating  system  warm-start  to  examine.  The 
facility  enables  the  DebugCore  to  remain  in  a  final  system  and  only  be  called  upon  in 
a  emergency  such  as  memory  access  violation. 

The  CFG  module  is  used  to  configure  the  operation  of  the  DBG  module.  There 
is  really  no  need  to  have  the  source  code  for  the  DBG  module,  only  the  CFG  module. 
After  configuring  the  CFG,  it  can  be  assembled  and  linked  with  the  .o  debug  core 
modules  (dbg_core.o  and  dbg.o).  The  CFG  supplies  the  cfg_peek()  and  cfg_poke() 
functions,  as  well  as  defining  the  number  of  breakpoints  supported  and  the  size  of  the 
DebugCore  message  send  buffer.  Note,  however,  that  there  is  conditional  assembly 
code  in  the  CFG  module  for  a  wide  range  of  target  hardware  systems.  In  practice  con¬ 
figuring  CFG  normally  means  defining  the  correct  symbol  value  during  assembly. 

Whenever  the  DebugCore  is  entered,  the  routine  cfg_core_enter()  is  called. 
This  gives  the  DebugCore  user  an  opportunity  to  control  the  state  of  the  processor 
during  DebugCore  operation.  For  example,  normally  the  DebugCore  runs  with  the 
on-chip  timer  turned  off.  This  means  no  timer  progress  is  made  and  no  timer  inter¬ 
rupts  will  occur  while  the  DebugCore  is  in  context.  The  timer  can  be  re-enabled  by 
changing  the  code  in  cfg_core_enter().  The  supplied  code  also  locks  the  processor 
cache  (only  with  processor  members  supporting  cache).  This  prevents  application 
and  operating  system  relevant  data  being  displaced  with  DebugCore  information. 

The  DebugCore  is  mainly  written  in  the  C  language  and  makes  use  of  applica¬ 
tion  space  processor  registers  during  its  operation.  On  taking,  say,  a  breakpoint  and 
entering  the  DebugCore,  all  the  processor  registers  are  copied  to  shadow  memory 
locations.  Users  examine  and  change  the  shadow  values  before  they  are  returned  to 
registers  when  the  DebugCore  context  is  exited.  It  is  possible  that  an  external  hard¬ 
ware  device  could  generate  an  interrupt  when  the  DebugCore  is  in-context  (inter¬ 
rupts  may  be  enabled  in  the  cfg_core_enter()  procedure).  This  could  cause  some 
confusion  as  the  interrupt  handler  may  wish  to  modify  some  operating  system  as¬ 
signed  registers  to  record  a  change  in  the  interrupting  device  status.  The  change 
would  be  lost  when  the  DebugCore  exited.  To  overcome  this  problem,  global  regis- 
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ters  gr64-gr95  are  not  shadowed  if  memory  location  dbg_shadow_os  contains  a  0 
(normally  set  to  -1).  This  can  be  done  in  the  cfg_core_enter()  procedure. 

When  dbg_shadow_os  is  cleared,  physical  registers  gr64-gr95  are  always  ac¬ 
cessed  with  MiniMON29K  READ  and  WRITE  messages.  However,  messages  such 
as  FIND  and  COPY  operate  on  the  shadow  copies  only,  and  this  creates  some  minor 
restrictions  in  DebugCore  operation. 

If  cfg_core_enter()  is  modified  to  enable  the  on-chip  timer  to  continue  inter¬ 
rupting  during  DebugCore  operation,  then  memory  location  dbg_shadow_timer 
should  also  be  set  to  0  (normally  -1).  This  prevents  the  TMR  and  TMC  timer  special 
registers  from  being  restored  from  their  corresponding  shadow  memory  locations 
when  the  DebugCore  context  is  exited. 

Interrupts  must  be  enabled  during  DebugCore  operation  if,  say,  an  interrupt 
driven  UART  is  being  used  for  MiniMON29K  message  communication.  It  is  some¬ 
times  possible  to  use  the  message  system  in  a  poll-mode  (described  in  the  following 
section),  in  this  case  interrupts  can  be  disabled.  Additionally,  it  may  be  possible  to 
selectively  enable  device  interrupts  in  cfg_core_enter().  However,  care  should  be 
taken  if  any  of  the  interrupts  require  C  level  context  for  interrupt  processing.  The  De¬ 
bugCore  continues  to  use  the  register  stack  in  place  at  the  time  the  DebugCore  was 
entered.  The  DebugCore  will  not  need  to  lower  the  stack  support  registers,  but  any  C 
level  interrupt  handler  may  make  temporary  use  of  the  stack  (this  is  very  much  oper¬ 
ating  system  dependent).  Further,  it  is  important  that  no  attempt  is  made  to  reenter 
the  DebugCore,  via,  say,  a  memory  access  error  during  an  interrupt  service  routine 
which  interrupted  the  DebugCore  operation. 

Breakpoints  located  at  both  physical  and  virtual  addresses  are  supported  if  the 
processors  has  on-chip  breakpoint  control  registers.  Without  breakpoint  registers, 
breakpoints  are  always  located  at  physical  addresses.  However,  per-process  break¬ 
points  are  supported  even  if  the  processor  has  no  on-chip  MMU  support;  or  if  the 
MMU  is  not  in  use  because  separate  processes  are  each  running  is  Supervisor  mode. 
Breakpoint  capabilities  are  presented  in  detail  in  section  D.3. 

7.3.6  The  Message  System 

After  the  message  system  has  been  initialized  with  a  call  to  msg_init(),  the  De¬ 
bugCore  responds  to  MonTIP  host  messages  appropriately  and  sends  acknowledge 
messages  to  the  host  containing  any  requested  data.  The  operating  system  can  also 
make  use  of  the  message  system  to  support  application  services  such  as  access  to  the 
file  system  on  the  TIP  host  machine.  The  msg_send()  function  is  used  to  request  a 
single  message  be  sent.  A  similar  function  is  made  available  by  the  message  system 
module  on  the  TIP  host  processor. 


int  msg_send ( struct  message  *msg_pointer ) ; 
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The  function  returns  0  if  the  message  was  accepted  for  sending  and  -1  if  the 
message  system  is  currently  too  busy.  Variable  msg_sbuf_p  is  maintained  by  MSG 
to  point  to  the  message  buffer  currently  being  sent.  When  this  variable  becomes  0,  the 
message  system  is  ready  to  send  another  message.  The  message  buffer  pointer  passed 
to  msg_send()  is  copied  into  msg_sbuf_p  the  contents  of  the  buffer  are  not  copied. 
Thus  the  user  must  be  careful  not  to  modify  the  buffer  data  until  the  message  has  been 
completely  sent. 

Messages  are  received  by  asking  the  message  system  to  poll  the  message  driver 
hardware  until  a  message  is  available.  Function  msg_wait_for()  is  provided  for  this 
task.  Alternatively,  the  message  system  can  interrupt  the  operating  system  or  the  De- 
bugCore  when  a  message  is  received  from  the  TIP  host  processor.  Received  mes¬ 
sages  are  normally  located  at  address  msg_rbuf.  There  is  no  danger  of  the  receive 
buffer  being  over  written  by  a  new  in-coming  message,  as  the  MonTIP  always  ex¬ 
pects  to  receive  a  message  before  it  will  reply  with  a  new  message  to  the  target. 

7.3.7  MSG  Operation 

The  MSG  module  may  require  the  support  of  communications  port  specific 
driver  modules,  most  notably  the  SER  module.  This  module  contains  the  code  neces¬ 
sary  to  talk  to  serial  communication  UARTs  which  support  target  and  MonTIP  con¬ 
nection.  The  MSG  contains  a  number  of  shared  memory  communication  drivers  for 
IBM  PC- AT  plug-in  cards,  such  as,  the  PCEB,  EB29K,  YARC  and  others. 

Messages  all  have  the  same  format,  a  32-bit  message  number  then  a  32-bit 
length  field,  followed  by  any  message  related  data.  When  the  MSG  determines  that  a 
new  message  has  been  received,  and  its  message  number  is  greater  than  64,  the  oper¬ 
ating  system  is  interrupted  (if  interrupts  are  enabled),  and  execution  continues  at  the 
address  given  in  the  vector  table  for  entry  number  V_OS_MSG  (76).  In  OS-boot  this 
is  address  os_V_msg.  This  means  that  the  operating  system  does  not  have  to  poll  the 
message  interface  for  service  request  completion.  Polling  is  required  when  the  mes¬ 
sage  system  can  continue  to  operate  with  interrupts  turned  off.  The  message  system 
can  be  used  to  support  HIF  services  (see  the  later  OS-boot  section). 

Received  messages  with  identification  numbers  less  than  64  are  intended  for  the 
DebugCore.  The  MSG  causes  the  DebugCore  to  be  interrupted  via  vector  table  entry 
V_DBG_MSG  (75).  This  causes  execution  to  continue  at  address  dbg_V_msg. 
When  execution  begins  at  this  address,  the  processor  state  appears  as  if  a  hardware 
interrupt  has  just  occurred  while  executing  User  mode  code  or  an  operating  system 
service.  The  virtual  interrupt  mechanism  is  used  to  support  this  technique  and  is  de¬ 
scribed  below. 

7.3.8  MSG  Virtual  Interrupt  Mechanism 

Consider  what  happens  when  a  UART  receives  a  character  and  an  interrupt  is 
generated: 
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1  The  UART  serial  driver  enters  Freeze  mode  and  execution  continues  at  the 
address  given  in  the  vector  table  for  the  interrupt  handler.  (Note,  it  is  the 
operating  system  cold-start  code’s  responsibility  to  install  the  trap  handler  for 
this  interrupt,  even  if  a  MiniMON29K  SER  module  driver  is  used). 

2  Next  the  SER  driver  saves  some  global  registers  to  memory. 

3  The  driver  talks  to  the  UART,  receives  the  character  and  places  the  new  data  into 
the  msg_rbuf  buffer  at  the  location  given  by  the  pointer  msg_next_p.  The 
registers  are  restored  and  the  pointer  incremented. 

4  The  SER  driver  then  jumps  (virtual  vectors)  to  address  msg_V_arrive  in  the 
MSG  module.  This  whole  procedure  appears  to  the  message  system  as  if  the 
interrupt  had  been  directed  to  msg_V_arrive  when  a  character  arrived  in  its 
buffer. 

5  The  MSG  saves  its  own  working  register  space  and  examines  the  size  of  the 
incoming  message  and  decides  if  it  is  complete  or  if  more  data  is  required.  If 
incomplete  the  registers  are  repaired  and  an  IRET  is  issued.  When  complete, 
working  registers  are  repaired  and  the  PC-buffer  registers  are  updated  with 
address  of  the  operating  system  handler  or  DebugCore  handler  accessed  from 
the  vector  table. 

Using  the  sequence  described  above,  messages  arrive  via  a  V_DBG_MSG  or 
V_OS_MSG  virtual  interrupt  directly  to  the  appropriate  message  processing  han¬ 
dler.  The  operating  system  and  the  DebugCore  need  never  be  concerned  about  any 
registers  used  by  the  MSG  or  SER  modules  in  the  process  of  preparing  the  received 
message,  as  their  temporary  register  usage  is  kept  hidden. 

When  interrupts  are  being  used,  rather  than  polling  for  a  new  message  to  arrive, 
the  msg_wait_for()  function  simply  returns  0  indicating  that  no  message  is  avail¬ 
able.  If  the  SER  module  is  making  use  of  polling  and  intermpts  are  turned  off,  then 
the  msg_wait_for()  function  returns  -1  when  a  complete  new  message  is  available  in 
the  msg_rbuf.  In  fact  the  MSG  sets  variable  msg_rbuf_p  to  point  to  the  just-re¬ 
ceived  message  buffer.  The  DebugCore  interrupt  handler  dereferences  this  pointer 
when  accessing  any  received  messages. 

7.4  THE  OS-BOOT  OPERATING  SYSTEM 

MiniMON29K  is  a  debugger.  It  does  not  initialize  the  processor,  service  inter¬ 
rupts,  support  HIF  system  calls  or  even  install  itself  into  the  target  system.  All  these 
tasks  must  be  performed  by  an  operating  system.  It  does  seem  a  rather  grand  title  but 
OS-boot  does  perform  these  tasks.  If  a  user  does  not  build  an  operating  system  or  buy 
an  operating  system  from  a  third  party  then  OS-boot  may  be  adequate  for  their  proj¬ 
ect  needs. 
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AMD  generally  supplies  OS-boot  along  with  MiniMON29K  for  each  29K 
evaluation  system.  Because  OS-boot  supports  the  HIF  system  call  services  it  is  use¬ 
ful  for  running  evaluation  software.  However,  OS-boot  is  a  simple  operating  system, 
it  does  not  support  multi-tasking  or  other  grander  operating  system  concepts .  As  well 
as  supplying  MiniMON29K  and  OS-boot  in  EPROM,  users  get  the  source  to  OS- 
boot,  enabling  them  to  make  any  necessary  changes. 

Typically,  users  will  add  operating  system  code  to  support  additional  peripheral 
devices.  Or,  use  OS-boot  as  a  means  of  launching  into  another  more  sophisticated 
operating  system.  This  is  described  in  more  detail  later.  The  technique  is  useful  be¬ 
cause  it  avoids  the  need  to  install  MiniMON29K  with  the  new  operating  system  in 
EPROM.  The  new  operating  system  need  merely  be  down-loaded  via  Mini- 
MON29K  debugger  messages  into  available  target  memory. 

This  section  does  not  describe  OS-boot  in  detail.  It  is  mainly  an  overview  of  its 
operation.  Hopefully  users  will  gain  an  understanding  of  its  relevance  in  the  debug 
processes. 

7.4.1  Register  Usage 

According  to  the  register  usage  convention,  an  operating  system  is  free  to  use 
global  registers  in  the  range  gr64-gr95.  OS-boot  uses  a  good  number  of  these  regis¬ 
ters.  Many  of  the  floating-point  instructions  and  some  integer  instructions  are  not 
implemented  directly  by  hardware  with  some  members  of  the  29K  family.  This  re¬ 
quires  that  trapware  be  used  to  support  the  non-existing  instructions.  The  floating¬ 
point  trapware  included  with  OS-boot  requires  as  much  as  15  temporary  registers 
and  three  static  registers  to  support  the  trapware  code.  OS-boot  is  typically  config¬ 
ured  to  assign  registers  itO-ktll  ( gr64-gr79 )  for  temporary  use  and  ksl3-ksl5 
( gr93-gr95 )  for  static  use. 

The  exact  register  assignment  for  OS-boot  is  determined  by  file  register.s  in  the 
osboot  directory.  Other  than  trapware  support,  registers  are  required  for  run-time 
management  and  HIF  services.  These  registers  are  typically  allocated  from  the  range 
ks0-ksl2  ( gr80-gr92 ).  There  are  a  number  of  free  registers  for  those  requiring  to  add 
operating  system  support  code. 

7.4.2  OS-boot  Operation 

Operation  begins  at  address  label  os_cold_start.  The  processor  special  regis¬ 
ters,  such  as  CPS  and  CFG,  are  initialized  to  enable  the  processor  start-up  sequence 
to  commence.  OS-boot  does  not  contain  very  much  cold-start  code.  However,  the 
code  is  complicated  by  the  incorporation  of  options  enabling  any  member  of  the  29K 
family  to  be  dealt  with. 

The  vector  table  entries  are  constructed.  Most  of  the  unused  entries  are  set  to 
cause  DebugCore  entry.  Thus,  should  any  unexpected  trap  or  interrupt  happen  the 
DebugCore  will  be  able  to  report  it.  The  vector  table  is  normally  placed  at  the  start  of 
data  memory. 
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The  memory  system  is  then  analyzed  in  the  process  of  building  the  data  struc¬ 
ture  passed  to  dbg_control().  In  some  cases  this  involves  the  operation  of  dynamic 
memory  sizing  code.  The  floating-point  trap  handlers  are  then  prepared  for  opera¬ 
tion.  Initialization  of  floating-point  support  is  a  one-time  operation,  so  it  occurs  be¬ 
fore  dbg_control()  is  called. 

Before  the  cold-start  operation  is  complete,  additional  vector  table  entries  are 
made  to  support  DebugCore  operation,  entries  such  as  V_TRACE.  The  DebugCore/ 
OS  shared  data  structure  is  then  initialized  and  vector  table  entry  7 1  is  set  to  point  to 
the  base  of  the  data  structure.  The  message  system  is  then  initialized  with  a  call  to 
msg_init()  and  dbg_control()  is  called,  indicating  the  completion  of  operating  sys¬ 
tem  cold-start  code. 

The  return  from  dbg_control()  causes  execution  of  the  operating  system 
warm-start  code  to  commence  at  address  warm_start.  The  run  time  environment  is 
now  prepared.  Much  of  this  is  concerned  with  memory  management.  The  memory 
and  register  stack  support  registers  are  assigned  values  before  any  loaded  application 
code  starts.  The  warm-start  code  examines  the  return  parameters  from  dbg_con- 
trol()  in  preparing  the  run-time  environment. 

With  29K  family  members  which  have  TLB  hardware,  OS-boot  is  normally 
configured  to  start  application  code  execution  in  User  mode  with  address  translation 
turned  on.  Warm-start  code  gets  the  application  code  start  address  from  return  regis¬ 
ters  grlOO.  This  address  is  loaded  into  the  frozen  PC-buffer  registers  and  an  IRET 
used  to  depart  the  operating  system  supervisor  mode  code  and  enter  the  application 
code  in  User  mode.  Register  grl  04  is  used  to  select  operating  system  warm-start  op¬ 
tions.  If  bit  31  is  set  then  application  code  is  started  with  no  address  translation  en¬ 
abled.  (To  use  this  feature  set  grl04  to  -1  after  using  the  MonDFE  y  command  to 
yank-in  application  code  into  target  system  memory.)  Note,  warm-start  code  does 
not  issue  an  IRET  instruction  directly,  it  jumps  to  the  DebugCore  service  dbg_iret. 
This  enables  the  DebugCore  to  set  the  TE  bit  in  the  OPS  register  and  so  enable  single 
stepping  of  the  first  application  code  instruction.  Additionally  the  BTE  and  BPID 
fields  of  any  breakpoint  registers  in  use  are  also  set  by  dbg_iret. 

7.4.3  HIF  Services 

Once  application  code  has  started,  operating  system  code  will  only  be  again 
called  into  play  when:  a  floating-point  trap  occurs;  a  peripheral  generates  an  inter¬ 
rupt;  or  when  a  HIF  service  is  requested.  HIF  is  a  system  call  interface  specification. 
OS-boot  supplies  the  necessary  support  code  which  is  accessed  by  a  system  call  trap 
instruction.  Many  of  the  library  calls,  such  as  printf(),  result  in  HIF  trapware  being 
called.  HIF  trapware  support  starts  at  address  label  HIFTrap. 

HIF  services  are  divided  into  two  groups,  those  that  can  be  satisfied  by  the  29K 
itself  (such  as  the  sysalloc  service),  and  those  that  need  MonTIP  support  (such  as 
open).  The  HIF  specification  states  that  the  service  request  number  be  placed  in  reg- 
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ister  grl21,  if  this  number  is  less  than  256  then  MonTIP  must  assist.  A  request  for 
MonTIP  assistance,  to  say,  open  a  file  for  writing,  is  accomplished  by  the  operating 
system  sending  a  MiniMON29K  message  to  the  TIP  process.  There  are  currently 
three  types  of  messages  used  by  the  OS:  HIF-request,  CHANNEL1  (used  when 
printing  to  stdout),  and  CHANNELO_ACK  (used  when  acknowledging  data  from 
stdin).  Note,  it  is  easy  to  extend  the  operating  system  message  system  usage  and 
create  new  operating  system  message  types.  This  may  be  useful  if  virtual  memory 
paging  was  being  supported  by  an  operating  system,  where  the  MonTIP  was  acting  as 
the  secondary  memory  controller. 

MonTIP  replies  to  HIF  MiniMON29K  messages  by  sending  messages  to  the 
DebugCore  to  accomplish  the  requested  task.  It  then  sends  a  HIF_ACK  message  to 
the  operating  system  acknowledging  the  completion  of  the  requested  service. 

CHANNEL  1  and  CHANNELO  messages  are  used  by  the  operating  system  to 
support  display  and  keyboard  data  passing  between  the  application  program  and  the 
user.  Note  these  are  the  only  operating  system  messages  which  the  MonTIP  passes 
via  UDI  to  the  MonDFE  process.  MonTIP  responds  to  stdout  service  requests  with 
CHANNEL  1_ACK  message,  and  supplies  new  keyboard  input  characters  with  a 
CHANNELO  message  sent  to  the  operating  system.  (Note,  some  early  versions  of 
MonTIP  did  not  make  use  of  the  operating  system  *_ACK  messages,  they  used  the 
DebugCore  instead.  This  created  difficulties  for  multitasking  operating  systems.  If 
you  have  this  problem,  you  need  to  update  your  MonTIP  program.) 

Previously,  the  OS-boot  implementation  entered  Wait  mode  after  issuing  a 
MiniMON29K  message.  This  is  accomplished  by  setting  the  WM  bit  in  the  OPS  reg¬ 
ister  before  using  an  IRET  to  return  to  application  code  from  the  HIF  trap  handler. 
Wait  mode  is  exited  when  the  message  system  interrupts  the  operating  system  in  re¬ 
sponse  to  a  MonTIP  reply-message  to  the  operating  system.  Because  Wait  mode  is 
used  OS-boot  must  run  with  interrupts  turned  on.  However,  the  MiniMON29K  De¬ 
bugCore  has  no  such  restriction  and  can  operate  in  a  poll-mode  fashion.  Recent  ver¬ 
sions  of  OS-boot  can  also  operate  the  message  system  in  poll-mode  and  need  not 
have  interrupts  permanently  enabled.  The  latest  OS-boot  code  no  longer  uses  Wait 
mode  while  waiting  for  a  message  system  interrupt.  Either  the  message  system  or  a 
flag  variable  is  continually  polled.  The  flag  being  set  by  the  message  system  interrupt 
handler  which  previously  cleared  the  WM  bit. 

7.4.4  Adding  New  Device  Drivers 

OS-boot  is  a  very  simple  operating  system  and  it  does  not  offer  support  for  addi¬ 
tional  EO  devices.  However,  the  HIF  specification  states  that  file  descriptors  0,1  and 
2  are  assigned  to:  standard  in,  standard  out  and  standard  error.  Normally  any  open() 
library  calls  issued  by  an  application  program  will  result  in  the  HIF  open  service  re¬ 
turning  a  new  file  descriptor  for  a  file  maintained  on  the  TIP  host  by  MonTIP. 
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Target  hardware  can  often  have  additional  UART  or  parallel  port  hardware 
available  for  communication.  If  OS-boot  is  not  completely  replaced  with  a  new  OS, 
then  these  devices  should  be  accessed  via  the  normal  library /HIF  interface.  OS-boot 
can  be  extended  to  include  a  driver  to  support  any  new  peripheral  device.  Each  device 
should  be  pre-allocated  a  file  descriptor  value  starting  with  number  3.  All  access  to 
peripherals  can  then  be  to  the  pre-allocated  file  descriptors.  If  the  application  code 
calls  open()  then  the  HIF  open  service  should  initially  return  4,  or  some  larger  num¬ 
ber  depending  on  the  number  of  peripheral  devices  added. 

The  Metaware  libraries,  supplied  with  the  High  C  29K  compiler  package,  pre¬ 
allocate  buffer  and  MODE  settings  for  file  descriptors  0, 1  and  2.  Assuming  no  access 
to  the  library  source  file  _iob.c,  then  calls  to  open()  should  be  placed  inside  the  crtO.s 
file.  These  open()  calls  should  be  for  each  of  the  pre-allocated  file  descriptors  and 
will  result  in  library  initialization.  The  code  inside  crtO.o  runs  before  the  application 
main()  code.  Note,  the  MODE  value  for  the  open()  calls  may  be  restricted  due  to  driv¬ 
er  or  peripheral  limitations.  And  communication  with  the  devices  may  be  required  in 
RAW  mode  rather  than  any  buffered  mode  supported  by  the  library  when  a  device  is 
opened  in  COOKED  mode. 

When  library  calls,  or  HIF  calls  such  as  _read()  or  _write(),  are  issued  for  the 
file  descriptor  associated  with  a  peripheral  the  OS-boot  trapware  for  the  HIF  ser¬ 
vices  shall  call  upon  the  required  device  driver  to  perform  the  requested  task. 

7.4.5  Memory  Access  Protection 

The  OS-boot  operating  system  includes  an  optional  memory  access  protection 
scheme  which  is  useful  with  embedded  system  debugging.  It  only  functions  with 
29K  family  members  which  contain  TLB  hardware.  When  used,  the  operating  sys¬ 
tem  runs  application  programs  in  User  mode  with  address  translation  turned  on. 
Thus,  all  application  addresses  are  virtual,  but  the  memory  management  hardware  is 
configured  to  map  virtual  to  physical  addresses  with  a  one-to-one  scheme.  No 
memory  paging  takes  place  and  the  entire  program  is  at  all  times  located  in  the  avail¬ 
able  target  system  memory. 

The  benefit  of  the  system  is  that  bad  addresses,  generated  by  unexpected  pro¬ 
gram  execution,  can  be  detected  immediately.  The  operation  of  the  29K  Translation 
Look-aside  Buffer  (TLB)  used  to  construct  the  management  scheme  was  briefly  de¬ 
scribed  in  previous  section  7.2.2  entitled  Memory  Access  Protection.  This  section 
deals  with  the  OS-boot  code  implementation.  For  more  information  about  operation 
of  TLB  hardware  see  Chapter  6  ( Memory  Management  Unit) 
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First  consider  the  typical  OS-boot  memory  configuration  shown  in  Figure  7-8. 
Some  29K  family  members  have  a  3-bus  architecture.  This  enables  two  memory 
systems  to  be  utilized,  one  for  instruction  memory  and  the  second  for  data  memory.  If 
instructions  are  to  be  accessed  from  data  memory  devices,  or  data  placed  in 
instruction  memory,  then  a  bridge  must  be  built  between  the  data  and  instruction 
memory  busses.  Note,  a  single  address  bus  is  shared  by  both  memory  systems. 
Typically,  designers  will  build  a  bridge  enabling  instruction  memory  to  be  accessed 
from  data  memory  address  space.  In  such  case  the  two  addresses  spaces  do  not 
overlap.  However,  without  a  bridge  it  is  possible  to  have  physical  memory  located  in 
the  different  address  spaces  but  the  same  address  offset  location. 

Most  of  the  newer  29K  family  members  have  a  conventional  2-bus  architecture, 
which  results  in  instructions  and  data  being  located  in  the  same  memory  devices  lo- 
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Figure  7-8.  Typical  OS-boot  Memory  Layout 
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cated  in  a  single  address  space.  OS-boot  caters  for  all  the  different  memory  configu¬ 
ration  options,  and  this  is  reflected  in  the  layout  shown  in  Figure  7-8. 

Operating  system  warm-start  code  knows  the  address  regions  allocated  to  a 
loaded  application  program  by  examining  the  data  structure  returned  from  the 
dbg_control()  call.  OS-boot  actually  saves  the  data  to  memory  locations  for  future 
use,  as  we  will  see.  Applications  can  be  expected  to  access  a  limited  number  of  re¬ 
gions  out-with  the  data  region  loaded  from  the  application  COFF  file.  This  is  re¬ 
quired  to  support  the  memory  allocation  heap  and  the  register  and  memory  stacks. 
The  allocated  access  regions  are  shown  shaded  in  Figure  7-8.  An  attempt  to  access  an 
address  out-with  allowed  regions  will  cause  the  DebugCore  to  gain  control  of  pro¬ 
gram  execution. 

During  normal  code  execution,  instruction  and  data  TLB  misses  will  occur.  This 
requires  that  the  TLB  registers  be  refreshed  with  a  valid  address  translation.  OS-boot 
trap  handlers  are  used  to  perform  this  task.  If  a  bad  address  is  generated  the  trap  han¬ 
dlers  must  detect  it. 

Two  kinds  of  traps  are  expected:  Instruction  TLB  misses  and  data  TLB  misses. 
The  trap  handler  for  instruction  misses  is  shown  below.  The  return  values  from  the 
dbg_control(),  shown  in  Figure  7-7,  are  stored  by  OS-boot  in  a  structure  at  address 
ret_struct.  The  PCI  value  is  compared  with  the  start  and  end  addresses  of  the  loaded 
program.  If  the  PCI  address  is  within  this  range  then  a  new  valid  TLB  entry  is  built 
and  program  execution  restarted.  If  the  address  is  out  of  the  allowed  range  then  a 
jump  to  the  DebugCore  entry  point,  dbg_trap,  is  taken. 

UITLBmiss : 


mfsr 

itO, pci 

;PC  address 

const 

itl, ret_struct 

consth 

itl, ret_struct 

load 

0 , 0 , it2 , it 1 

;  TEXT 

start 

cpltu 

it2 , itO , it2 

;  jump 

if 

jmpt 

it 2 , UI invalid 

;  PC  < 

start 

add 

itl, itl, 4 

load 

0 , 0 , it2 , it 1 

;  TEXT 

end 

cpgtu 

it2 , it 0 ,  it2 

;  jump 

if 

jmpt 

it 2 , UI invalid 

;  PC  > 

end 

const 

it 2, (VE | UE) 

one_to_one : 

TLB  register  Word  0  has  access  control  bits  which  separately  enable  Read, 
Write  and  Execution  of  data  for  the  addressed  page.  The  example  code  assumes  that 
data  and  instructions  are  not  located  on  the  same  page  as  pages  containing  instruc¬ 
tions  are  marked  for  execution  only. 


>_one : 

; it2  has  RWE 

bits 

mfsr 

it 3, mmu 

; need  page  size 

srl 

it3, it3, 8 

; get  PS  bits. 

and 

it3, it3, 3 

; lk  page  min 

add 

it3, it3,  10  +  5 

; 32-sets 
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srl 

itl, itO, it3 

;  form 

sll 

itl, itl, it3 

;  VTAG 

or 

itl, itl, it2 

; add  RWE  bits 

mf  sr 

it 2 , mmu 

; get  PID 

and 

it2 , it2 , OxFF 

or 

itl , itl , it2 

; add  PID  bits 

sub 

it3, it3, 5 

;page  size 

srl 

it2 , it 0 , it 3 

;  form 

sll 

it2 , it2 , it3 

;  RPN 

mf  sr 

itO, lru 

; select  column 

mttlb 

itO, itl 

;word  0 

add 

itO, itO, 1 

mttlb 

iret 

itO, it2 

;word  1 

Ulinvalid: 


jmp  dbg_trap  ; enter  DebugCore 

const  gr 64 , 8 


The  data-miss  trap  handler  is  a  little  more  complicated.  The  address  under  con¬ 
sideration  appears  in  channel  register  CHA.  The  address  is  first  tested  to  see  if  it  is 
greater  than  the  data  region  start  address  and  less  than  the  current  heap  pointer.  The 
operating  system  maintained  heap  was  initialized  just  above  the  end  of  the  loaded 
program  data  region.  If  the  address  is  not  within  this  range  then  it  is  tested  to  deter¬ 
mine  if  it  is  within  the  memory  or  register  stack  regions.  The  stacks  are  located  at  the 
very  top  of  physical  data  memory. 


UDTLBmiss : 


stacks 


mf  sr 

itO, cha 

;data  address 

const 

itl, ret_struct+8 

consth 

itl, ret_struct+8 

load 

0 , 0 , it2 , it 1 

; DATA  start 

cpltu 

it2 , itO , it2 

; jump  if 

jmpt 

it 2 , UD invalid 

; adds  <  start 

cpltu 

it 2 , itO , heapptr ; 

adds  <  heappt 

jmpt 

it2 ,  one_to_one 

const 

it2  ,  (VE | UR| UW) 

const 

it2, HIGHMEM 

consth 

it2, HIGHMEM 

load 

0 , 0 , it2 , it2 

; DATA  end 

cpgeu 

it2 , it 0 , it2 

; jump  if 

jmpt 

it 2 , UD invalid 

; adds  >=  end 

cpgeu 

it2 ,  itO , msp 

; jump  if 

jmpt 

it2 ,  one_to_one  ; 

adds>=msp 

const 

it2 , (VE | UR| UW) 

Lid: 

jmp 

dbg_trap 

; enter  DebugC 

const 

gr64, 9 
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The  example  trap  handler  marks  data  pages  for  read  and  write  access  only.  If  the 
CHA  address  does  not  fall  within  the  allowed  region,  then  a  TLB  entry  is  not  built, 
and,  normally,  program  execution  not  restarted.  Instead,  the  DebugCore  is  entered 
and  the  trap  number  passed. 

7.4.6  Down  Loading  a  New  OS 

One  way  to  replace  OS-boot  with  another  operating  system  is  to  simply  link  the 
new  operating  system  with  the  MiniMON29K  modules  and,  if  necessary,  place  the 
result  in  EPROM  memory.  Or  alternatively  down-load  the  linked  image  to  the  target 
29K  system  using  MiniMON29K  messages.  However,  many  users  like  to  keep  the 
existing  OS-boot/MiniMON29K  combination  in  place  and  down-load  only  the  new 
operating  system  (or  a  portion  of  it)  —  this  can  create  complications.  Assuming  no 
changes  are  made  to  the  supplied  OS-boot,  then,  when  the  loaded  OS’s  execution  is 
started,  with  say  a  MonDFE  ‘g’  command,  warm-start  code  will  prepare  for  execu¬ 
tion  to  begin  at  the  first  instruction  of  the  new  operating  system.  Generally  a  HIF  se- 
trap  service  call  is  made  followed  by  an  assertion  of  the  assigned  trap  number.  This 
allows  Supervisor  mode  to  be  entered. 

The  new  operating  system  must  initially  run  in  Supervisor  mode  to  take  over 
processor  resources  initially  under  OS-boot  control.  If  the  floating-point  trap  han¬ 
dlers  are  to  remain  installed,  then  the  new  operating  system  must  be  careful  to  re¬ 
member  their  global  register  support  requirement.  If  the  new  operating  system  is  still 
supporting  HIF  services  then  it  must  also  pay  attention  to  the  HIF  trapware  register 
usage.  HIF  traps  will  occur  if  any  application  code  run  by  the  new  OS  is  linked  with 
libraries  intended  for  use  with  a  HIF  conforming  operating  system.  However,  often  a 
new  operating  system  will  replace  the  HIF  libraries  with  new  libraries  which  do  not 
call  HIF,  but  make  use  of  the  system  call  services  of  the  new  operating  system. 

The  HIF  trapware  code  can  be  replaced  with  new  code,  whose  register  usage  is 
better  integrated  with  the  new  operating  system,  by  the  new  system  taking  over  the 
HIF  vector  table  entry.  If  this  is  done,  then  it  is  likely  that  the  operating  system  mes¬ 
sage  interrupt  handler  will  also  be  taken  over.  Unless  the  os_V_msg  trap  handler  ad¬ 
dress  is  replaced,  the  message  system  will  continue  to  call  the  OS-boot  interrupt  han¬ 
dler.  And  the  associated  operating  system  register  usage  should  be  taken  into  ac¬ 
count.  Alternatively,  AMD  supplies  driver  routines  which  make  it  easy  for  a  new  op¬ 
erating  system  to  use  the  original  message  system  for  standard  input  and  output  com¬ 
munication.  This  eliminates  the  need  for  the  new  OS  to  to  takeover  the  message  sys¬ 
tem  interrupt  handlers. 

The  MiniMON29K  message  system  is  typically  supported  by  low  level  driver 
code  which  is  often  interrupt  driven.  Most  often  this  is  a  UART  interrupt  handler.  The 
message  system  will  not  generate  virtual  interrupts  if  the  low  level  handler  vector 
table  entry  is  taken  over.  This  can  be  necessary  because  of  interrupt  overloading.  For 
example,  the  Am29200  interrupt  INTR3  is  used  for  all  peripheral  devices  including 
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the  on-chip  UART.  A  new  operating  system  may  wish  to  add  support,  for  say,  DMA 
activity,  which  was  not  supported  by  OS-boot.  This  may  require  an  interrupt  handler 
activated  by  INTR3.  If  the  MiniMON29K  message  system  is  to  continue  operation, 
then  the  new  operating  system  must  take  over  the  INTR3  vector  table  entry.  B  ut,  after 
the  new  operating  system  handler  is  complete  it  must  jump  to  the  original  vector  han¬ 
dler  address  rather  than  IRETing.  This  gives  the  message  system  low  level  interrupt 
handler  an  opportunity  to  run.  A  better  alternative  is  to  use  the  technique  described  in 
section  2.5.5  to  deal  with  INTR3  overloading. 

7.5  UNIVERSAL  DEBUG  INTERFACE  (UDI) 

Code  development  for  an  embedded  processor  is  generally  more  costly  than  de¬ 
velopment  of  code  of  equivalent  complexity  intended  for  execution  on  a  engineering 
workstation.  The  embedded  application  code  can  not  benefit  from  an  underlying  sup¬ 
port  operating  system  such  as  UNIX.  In  some  cases,  developers  may  chose  to  first 
install  a  small  debug  support  monitor,  such  as  MiniMON29K,  or  third-party  execu¬ 
tive  which  can  offer  a  somewhat  improved  development  environment.  In  the  process 
of  getting  an  embedded  support  monitor  running  or  developing  application  code  to 
run  directly  on  the  processor,  emulation  hardware  may  be  employed.  The  availability 
of  debug  tools  and  their  configurability  is  an  important  factor  when  selecting  a  pro¬ 
cessor  for  an  embedded  project. 

The  architecture  of  the  latest  RISC  processors  may  be  simplified  compared  to 
their  CISC  predecessors,  but  the  complexity  of  controlling  the  processor  operation 
has  not  been  reduced.  The  use  of  register  stacks  and  instruction  delay  slots  and  other 
performance  enhancing  techniques  has  lead  to  increased  use  of  high  level  program¬ 
ming  languages  such  as  C.  The  compiler  has  been  given  the  responsibility  of  produc¬ 
ing  efficient  assembly  code,  and  the  developer  rarely  deals  with  code  which  manipu¬ 
lates  data  at  the  processor  register  level.  The  increased  productivity  achievable  by 
this  approach  is  dependent  on  high  level  debug  support  tools. 

Developers  of  products  containing  embedded  processors  are  looking  to  RISC 
for  future  products  offering  increased  capability.  The  greater  performance  relative  to 
RISC  processor  cost  should  make  this  possible.  The  suitability,  cost  and  productivity 
of  the  tools  available  for  code  development  are  likely  to  be  the  major  factor  in  decid¬ 
ing  the  direction  ahead  in  preparing  to  tool-up  for  RISC. 

The  following  sections  describe  the  Universal  Debug  Interface  (UDI),  which  is 
processor  independent  and  enables  greater  debug  tool  configurability.  A  number  of 
emulator  and  embedded  monitor  suppliers,  as  well  as  high  level  language  debug 
tools  suppliers,  are  currently  configuring  their  tools  to  comply  with  the  proposed 
UDI  standard.  Current  implementations  are  targeted  for  RISC  processor  code  devel¬ 
opment.  UDI  should  ease  the  choice  in  selecting  tools  and,  consequently,  selecting 
RISC.  This  section  shall  concentrate  on  describing  the  Free  Software  Foundation’s 
GDB  C  language  source  debugger’s  integration  with  UDI. 
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7.5.1  Debug  Tool  Developers 

A  debug  tool  developer  typically  arranges  for  their  product  to  be  available  for  a 
range  of  popular  processors.  This  normally  means  rebuilding  the  tool  with  the  knowl¬ 
edge  required  to  understand  the  peculiarities  of  each  processor.  If  an  enhancement  is 
made  to  the  debugger  user-interface,  then  normally  the  debugger  source  and  the  pro¬ 
cessor  specific  information  must  be  recompiled  and  tested  before  customers  are  up¬ 
dated. 

When  developing  code  to  run  on  an  engineering  workstation,  the  processor  sup¬ 
porting  the  debugger  execution  is  the  same  processor  running  the  program  being  de¬ 
veloped.  This  means  the  debugger  can  make  use  of  operating  system  services  such  as 
ptrace()  (see  section  7.5.3),  to  examine  and  control  the  program  being  debugged. 
When  developing  code  for  an  embedded  application,  the  program  being  developed  is 
known  as  the  Target  Program  and  executes  on  the  Target  Processor  which  is  usually  a 
different  processor  than  the  one  supporting  the  debugger,  known  as  the  Host  Proces¬ 
sor.  The  host  processor  and  target  processor  do  not  communicate  via  the  ptrace()  sys¬ 
tem  call,  but  via  whatever  hardware  communication  path  links  the  two  processors. 
The  portion  of  the  debugger  which  controls  communication  with  the  target  processor 
is  known  as  the  target  interface  module,  and  whenever  a  change  or  addition  is  re¬ 
quired  in  the  communications  mechanism,  the  debugger  must  be  once  again  recom¬ 
piled  to  produce  a  binary  executable  which  is  specific  to  the  target-processor  and  tar- 
get-communications  requirements. 

When  the  chipmakers  turn  out  their  latest  whiz-bang  RISC  processor,  the  tool 
developer  companies  are  faced  with  considerable  development  costs  in  ensuring 
their  tools  function  with  the  new  architecture.  It  is  not  uncommon  for  the  availability 
of  debug  tools  to  lag  behind  RISC  chip  introduction.  Often  tools  are  introduced  with 
limited  configuration  options.  For  example,  target  processor  communication  may  be 
according  to  a  low  level  debug  monitor  protocol,  or  an  in-circuit  emulator  (ICE)  pro¬ 
tocol.  Each  debugger  product  has  its  own  target  interface  module;  this  module  must 
be  developed  for  each  debugger  in  order  to  communicate  with  the  new  target  RISC 
processor. 

An  embedded  application  developer  may  have  prior  experience  or  a  preference 
for  a  particular  debug  tool,  but  the  only  available  communications  path  to  the  target 
may  not  be  currently  supported.  This  incompatibility  may  discourage  the  developer 
from  choosing  to  use  a  new  processor.  It  is  desirable  that  debuggers  share  commu¬ 
nication  modules  and  be  more  adaptable  to  available  target  processor  interfaces. 

Ideally  a  debugger  from  one  company  should  be  able  to  operate  with,  say,  an 
emulator  from  another  company.  This  would  make  it  possible  for  a  customer  to  select 
a  little  used  debugger  with  a  popular  target  monitor  or  vice  versa. 

The  goal  of  the  Universal  Debug  Interface  (UDI)  is  to  provide  a  standard  inter¬ 
face  between  the  debugger  developer  and  the  target  communications  module,  so  the 
two  can  be  developed  and  supplied  separately.  In  fact,  an  applications  developer 
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could  construct  their  own  communications  module,  for  some  special  hardware  com¬ 
munications  link,  as  long  as  it  complied  with  the  standard. 

7.5.2  UDI  Specification 

If  UDI  were  a  specification  at  procedural  level,  then  debugger  developers  and 
communication  module  developers  would  have  to  supply  linkable  images  of  their 
code  so  the  debug  tool  combination  could  be  linked  by  the  intended  user.  This  is  un¬ 
desirable  because  it  would  require  a  linked  image  for  every  tool  combination.  Addi¬ 
tionally,  the  final  linked  program  would  be  required  to  run  on  an  single  debug  host. 
UDI  actually  relies  on  an  interprocess  communication  (IPC)  mechanism  to  connect 
two  different  processes.  The  debugger  is  linked  into  an  executable  program  to  run  on 
the  host  processor,  this  process  is  known  as  the  Debugger  Front  End  (DFE) .  The  com¬ 
munications  module  is  linked  as  a  separate  process  which  runs  on  the  same  or  a  dif¬ 
ferent  host  processor,  this  process  is  known  as  the  Target  interface  Process(TIP).  The 
two  processes  communicate  via  the  UDI  interprocess  communication  specification. 

Two  IPC  mechanisms  have  so  far  been  specified:  one  uses  shared  memory  and  is 
intended  for  DOS  developers,  the  second  uses  sockets  and  is  intended  for  UNIX  and 
VMS  developers.  Of  course,  when  the  shared  memory  IPC  implementation  is  used 
the  DFE  and  TIP  processes  must  both  execute  on  the  same  host  processor.  Using 
sockets  with  Internet  domain  communication  enables  the  DFE  and  TIP  to  each 
execute  on  separate  hosts  on  a  computer  network.  Thus  an  applications  developer 
can,  from  the  workstation  on  his  desk,  debug  a  target  processor  which  is  connected  to 
a  network  node  located  in  a  remote  hardware  lab.  Using  sockets  with  UNIX  domain 
addresses  (the  method  used  to  implement  UNIX  pipes)  enables  both  processes  to  run 
on  the  same  host. 

Some  of  the  currently  available  UDI  conforming  debug  tools  are  presented  in 
Figure  7-9.  The  interprocess  communications  layer  defined  by  UDI  enables  the  ap¬ 
plications  developer  to  select  any  front  end  tool  (DFE)  with  any  of  the  target  control 
tools  (TIP). 

Because  developers  of  UDI  conforming  tools  must  each  have  code  which  inter¬ 
faces  with  the  IPC  mechanism  according  to  the  UDI  protocol,  the  UDI  community 
freely  shares  a  library  of  code  know  as  the  UDI-p  library.  This  code  presents  a  proce¬ 
dural  layer  which  hides  the  IPC  implementation.  For  example,  consider  the  follow¬ 
ing  procedure: 

The  DFE  code  calls  the  UDIRead  function  which  transports  the  function  call  to 
the  TIP  process.  The  TIP  code  developer  must  resolve  the  function  request,  by  adding 
code  which  is  specific  to  controlling  the  particular  target.  The  IPC  layer  is  effectively 
transparent,  the  TIP  developer  is  unaware  that  the  procedure  caller  is  from  a  different 
process,  possibly  on  a  different  host  machine.  Table  7-2  lists  most  of  the  UDI-p  pro¬ 
cedures  available. 
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Debugger 
Front-end  Process 


UDI-IPC  Remote-target 

Layer  Interface  Process 


Figure  7-9.  Currently  Available  Debugging  Tools  that  Conform  to  UDI  Specification 


Because  the  DFE  and  TIP  processes  may  be  running  on  different  machines,  care 
must  be  taken  when  moving  data  objects  between  hosts.  An  “inf’  sized  object  on  the 
DFE  supporting  machine  may  be  a  different  size  from  an  “inf  ’  on  the  TIP  supporting 
machine.  Further,  the  machines  may  be  of  different  endian.  The  UDI-p  procedures 
make  use  of  a  machine  independent  data  description  technique  similar  to  the  XDR 
library  available  with  UNIX.  Data  is  converted  into  a  universal  data  representation 
(UDR)  format  before  being  transferred  via  sockets.  On  being  received,  the  data  is 
converted  from  UDR  format  into  data  structures  which  are  appropriate  for  the  receiv¬ 
ing  machine.  The  UDI-p  procedures  keep  the  UDR  activity  hidden  from  the  UDI 
user. 
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Table  7-2.  UDI-p  Procedures  (Version  1.2) 


Procedure 


Operation 


UDIConnect 

UDIDisconnect 

UDISetCurrentConnection 

UDICapabilities 

UDIEnumerateTIPs 

UDICreateProcess 

UDISetCurrentProces 

UDIDestroy  Process 

UDIInitializeProcess 

UDIRead 

UDIWrite 

UDICopy 

UDIExecute 

UDIStep 

UDIStop 

UDIWait 

UDISetBreakpoint 

UDIQueryBreakpoint 

UDIClearBreakpoint 


Connect  to  selected  TIP 

Disconnect  from  TIP 

For  multiple  TIP  selection 

Obtain  DFE  and  TIP  capability  information 

List  multiple  TIPs  available 

Load  a  program  for  debugging 

Select  from  multiple  loaded  programs 

Discontinue  program  debugging 

Prepare  runtime  environment 

Read  data  from  target-processor  memory 

Write  data  to  target-processor  memory 

Duplicate  a  block  of  data  in  target  memory 

Start/continue  target-processor  execution 

Execute  the  next  instruction 

Request  the  target  to  stop  execution 

Inquire  about  target  status 

Insert  a  breakpoint 

Inquire  about  a  breakpoint 

Remove  a  breakpoint 


In  later  sections  of  this  chapter,  the  development  of  a  UDI  conforming  GDB,  a 
source  level  debugger  from  the  Free  Software  Foundation  and  Cygnus  Support,  is 
discussed  in  more  detail.  GDB  is  an  example  of  a  DFE  process.  As  an  example  of  a 
TIP  process,  we  shall  look  at  the  MiniMON29K  monitor  and  the  Instruction  Set  Sim¬ 
ulator  from  AMD.  Most  users  of  GDB  will  have  some  knowledge  of  the  ptrace() 
system  call  which  enables  GDB  to  examine  the  state  of  the  process  being  debugged. 
A  brief  description  of  ptrace()  is  beneficial  along  with  further  explanation  of  its  un¬ 
suitability  for  embedded  application  software  development. 

7.5.3  P-trace 

UNIX  system  call,  ptrace(),  provides  a  means  by  which  a  process  may  control 
the  execution  of  another  process  executing  on  the  same  processor.  The  process  being 
debugged  is  said  to  be  “traced”.  However,  this  does  not  mean  that  the  execution  path 
of  a  process  is  recorded  in  a  “trace  buffer”  as  is  the  case  with  many  processor  emula¬ 
tors.  Debugging  with  ptrace()  relies  on  the  use  of  instruction  breakpoints  and  other 
hardware  or  processor  generated  signals  causing  execution  to  stop. 

ptrace ( request ,  pid,  addr,  data) 

There  are  four  arguments  whose  interpretation  depends  on  the  request  argu¬ 
ment.  Generally,  pid  is  the  process  ID  of  the  traced  process.  A  process  being  de¬ 
bugged  behaves  normally  until  it  encounters  some  signal  whether  internally  (proces- 
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sor)  generated,  like  illegal  instruction,  or  externally  generated,  like  interrupt.  Then 
the  traced  process  enters  a  stopped  state  and  the  tracing  process  is  notified  using  the 
wait()  system  call.  When  the  traced  process  is  in  the  stopped  state,  its  core  image  can 
be  examined  and  modified  using  the  ptrace()  service.  If  desired,  another  ptrace()  re¬ 
quest  can  then  cause  the  traced  process  either  to  terminate  or  to  continue.  Table  7-3 
lists  the  ptrace()  request  services  available. 

Table  7-3.  ptrace( )  Sendees 


Request  Operation 


Trace  Me 

PeekText 

PeekData 

PeekUser 

PokeText 

PokeDate 

PokeUser 

Cont 

Kill 

SingleStep 

GetRegs 

SetRegs 

ReadText 

ReadData 

WriteText 

WriteData 

SysCall 


Declare  that  the  process  is  being  traced 

Read  one  word  in  the  process’s  instruction  space 

Read  one  word  in  the  proceses’s  data  space 

Examine  the  processes-control  data  structure 

Write  one  word  in  process’s  text  space 

Write  one  word  in  process’s  data  space 

Write  one  word  in  process-control  data  structure 

Startup  process  execution 

Terminate  the  process  being  debugged 

Execute  the  next  instruction 

Read  processor  register 

Write  processor  register 

Read  data  from  process’s  instruction  space 

Read  data  from  process’s  data  space 

Write  data  into  process’s  instruction  space 

Write  data  into  process’s  data  space 

Continue  execution  until  system  call 


Because  both  the  process  with  the  user-interface  controlling  the  debugging,  and 
the  application  process  being  debugged,  may  not  be  executing  on  the  same  processor, 
it  is  not  possible  to  use  the  ptrace()  system  call  mechanism  to  debug  embedded  ap¬ 
plication  software.  The  debugger  process  (DFE)  must  run  on  a  separate  processor 
and  communicate  with  the  processor  supporting  execution  of  the  application  code. 

The  Free  Software  Foundation’s  source  level  debugger,  GDB,  makes  use  of  the 
ptrace()  system  call.  However,  it  can  alternatively  use  a  collection  of  procedures 
which  support  communication  to  a  remote  processor.  These  procedures  implement 
the  necessary  protocols  to  control  the  hardware  connecting  the  remote  processor  to 
the  “host”  debug  processor.  By  this  means,  GDB  can  be  used  to  debug  embedded  ap¬ 
plication  software  running  on  application  specific  hardware.  The  following  section 
discusses  the  method  in  more  detail. 

7.5.4  The  GDB-UDI  Connection 

GDB  can,  in  place  of  ptrace(),  make  use  of  a  procedural  interface  which  allows 
communication  with  a  remote  target  processor.  Newer  versions  of  GDB  (version 
3.98  and  later)  achieve  this  via  procedure  pointers  which  are  members  of  a  target _ops 
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structure.  The  procedures  currently  available  are  listed  in  Table  7-4.  According  to 
GDB  configuration  convention,  the  file  remote-udi.c  must  be  used  to  implement  the 
remote  interface  procedures.  In  the  case  of  interfacing  to  the  IPC  mechanism  used  by 
UDI,  the  procedures  in  Table  7-4  are  mapped  into  the  UDI-p  procedures  given  in 
Table  7-2.  With  the  availability  of  the  UDI-p  library,  it  is  a  simple  task  to  map  the 
GDB  remote  interface  procedures  for  socket  communication  with  a  remote  target 
processor. 

Table  7-4.  GDB  Remote-Target  Operations 


Function  Operation 


to_open() 

to_close() 

to_attach() 

to_detach() 

to_start() 

to_wait() 

to_resume() 

to_fetch_register() 

to_store_register() 

to_xfer_memory() 

to_insert_breakpoint() 

to_remove_breakpoint() 

to_load() 


Open  communication  connection  to  remote  target 
Close  connection  to  remote  target 
Attach  to  a  loaded  and  running  program 
Detach  for  multitarget  debugging 
Load  program  into  target-system  memory 
Wait  until  target-system  execution  stops 
Startup/Continue  target-system  execution 
Read  target-system  processor  register(s) 

Write  register(s)  in  target-system  processor 
Read/Write  data  to  target-system  memory 
Establish  an  instruction  break  address 
Remove  a  breakpoint 

Load  a  program  into  target-processor  memory 


7.5.5  The  UDI-MiniMON29K  Monitor  Connection,  MonTIP 

MiniMON29K  monitor  code  can  not  function  without  the  support  of  a  software 
module  located  in  a  support  processor;  the  software  module  is  known  as  the  target 
interface  process  (TIP).  The  29K  target  processor  communicates  with  the  processor 
mnning  the  TIP  process  via  a  serial  link  or  other  higher  performance  channel.  This 
link  supports  a  message  system  which  is  private  to  the  MiniMON29K  monitor,  by 
that  I  mean  it  is  completely  independent  of  the  UDI  protocol.  (See  Figure  7-2.) 

MiniMON29K  must  be  installed  in  target  system  ROM  memory  or  down¬ 
loaded  by  the  TIP  host  via  a  shared  memory  interface.  The  target  application  code, 
and  additional  operating  system  code,  can  then  be  down-loaded  via  the  message  sys¬ 
tem.  If  changes  to  the  code  are  required,  then  the  message  system  can  be  used  to 
quickly  down-load  new  code  without  changing  any  ROM  devices. 

The  MiniMON29K  TIP  process,  montip,  converts  UDI  service  requests  into 
MiniMON29K  messages.  The  montip  program  which  runs  on  UNIX  machines,  typ¬ 
ically  communicates  with  the  target  using  an  rs232  link.  When  run  on  DOS  ma¬ 
chines,  it  may  communicate  using  an  rs232  connection  or  a  PC  plug-in  board  shared 
memory  scheme.  Note,  UNIX  machines  can  be  also  used  to  debug  PC  plug-in  cards; 
the  pcserver  program,  mn  on  DOS  machines,  enables  the  PC  serial  port  to  be  con- 
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nected  to  a  UNIX  machine.  The  MiniMON29K  messages,  transferred  to  the  DOS 
host  via  plug-in  card  shared  memory,  are  sent  to  the  TIP  host  via  the  rs232  connec¬ 
tion.  The  montip  program  supports  several  command-line  options,  as  shown  below. 
Not  all  are  applicable  to  both  DOS  and  UNIX  host  machines. 

montip  -t  target  [-r  OS-boot]  [-m  msg_log]  [-com  serial_port] 

[-re  msg_retries]  [-mbuf  msg_bufsize]  [-bl  msg_loopcount ] 

[-to  timeout]  [-seg  PC_seg_addr]  [-port  PC_port_base ] 

[-baud  baudrate]  [ — 1 e ]  [ — R | P ] 

A  explanation  of  the  command  line  options  can  be  obtained  by  just  entering 
montip  on  your  TIP  host  machine.  When  the  montip  process  is  started  it  advertises 
its  readiness  to  service  UDI  requests.  A  DFE  process  will  typically  connect  to  the 
TIP  process  and  a  debug  session  will  commence.  Alternatively,  there  is  no  need  to 
first  start  the  TIP  process.  When  a  DFE  process  is  started,  such  as  mondfe,  it  will  look 
for  the  advertised  TIP;  if  the  TIP  process  is  not  found  the  DFE  will  automatically  start 
the  TIP.  This  is  how  montip  is  normally  started.  The  start-up  montip  parameters  are 
taken  from  the  “UDI  Configuration  File”.  The  format  of  this  file  is  explained  in  the 
following  section  discussing  mondfe. 

7.5.6  The  MiniMON29K  User-Interface,  MonDFE 

The  MiniMON29K  DFE  process,  mondfe,  is  a  primitive  29K  debugger.  It  pro¬ 
vides  a  basic  user-interface  for  the  MiniMON29K  product.  It  is  fully  UDI  compliant 
(at  least  UDI  version  1.2);  and  it  can  be  used  with  any  of  the  available  TIP  processes 
such  as  isstip,  mtip,  montip,  etc.  It  is  very  easy  to  operate  but  has  less  debugging 
capability  compared  to  other  DFEs,  such  as  gdb,  xray29u  or  UDB  (see  section  7.7) 
etc.;  for  example  it  does  not  support  symbolic  debugging. 

It  is  very  useful  for  simply  loading  application  programs  and  starting  their 
execution  where  no  debugging  support  is  required.  Its  simple  command  set  also 
makes  it  easy  to  learn;  when  running,  simply  type  the  h  command  to  obtain  a  com¬ 
plete  list  of  available  commands.  The  h  command  can  also  be  used  to  explain  each 
command’s  operation;  for  example,  ‘7z  s”  will  explain  the  operation  of  the  set  com¬ 
mand.  Several  command-line  options  are  supported. 

mondfe  [-D]  -TIP  tip_id  [-q]  [-e  echo_file]  [-c  command_file] 

[-ms  mem_stack_size ]  [-rs  reg_stack_size ]  £— le] 

[-log  logfile]  [pgm_name  [arg_list]] 

A  list  of  command  line  options  can  be  had  by  entering  mondfe  on  your  DFE  host 
processor.  The  process  is  typically  started  by  entering  a  command  such  as  “modfe 
-D  -TIP  serial”.  The  “-D”  option  causes  an  interactive  debug  session  to  commence. 
The  UDI  conforming  TIP  process  communicating  with  mondfe  is  identified  by  the 
“-TIP  serial”  command  line  option. 

DFEs  and  TIPs  establish  communication  via  a  UDI  Configuration  File.  On 
UNIX  machines  this  file  is  called  udi_soc\  on  DOS  machines  it  is  called  udiconfs.txt. 
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The  Configuration  File  is  found  by  first  looking  in  the  current  working  directory.  If 
not  found,  the  file  given  by  environment  variable  UDICONF  is  searched  for.  Lastly, 
the  executable  PATH  is  searched.  The  format  of  these  files  is  very  similar,  on  UNIX: 


session_id  AF_UNIX  socket_name  tip_exe  tip_parameters 

session_id  AF_INET  host_name  port  <not  required> 


The  first  column  gives  the  session_id,  which  is  used  to  select  the  appropriate 
line.  The  serial  key-word  used  with  the  “-TIP”  option  in  the  example,  is  compared 
with  the  session_id  for  each  line  in  the  Configuration  File.  The  first  matching  line 
provides  all  the  necessary  data  for  connecting  to  a  TIP  process  which  is  already  run¬ 
ning;  or,  if  necessary,  starting  TIP  process  execution. 

The  second  column  gives  the  socket  domain  used  by  the  socket  IPC  mechanism 
connecting  the  two  processes.  Two  domains  are  supported.  The  AF_UNIX  domain 
indicates  both  processes  reside  on  the  same  host  processor.  Use  of  the  AF_INET  do¬ 
main  indicates  the  TIP  process  is  on  another  networked  host  machine.  In  such  a  case, 
the  host  name  and  socket  port  number  are  supplied  in  the  following  columns.  The 
UDI  specification  does  not  support  DFEs  starting  TIP  processes  on  remote  hosts. 
When  the  AF_INET  domain  is  used  to  connect  to  the  TIP,  the  TIP  process  must  be 
first  up  and  running  before  connection  is  attempted. 

When  the  AF_UNIX  domain  is  used,  the  third  column  gives  the  name  of  the 
socket  used  by  the  TIP  to  advertise  its  UDI  services.  If  the  DFE  is  unable  to  connect  to 
the  named  socket,  it  will  assume  the  TIP  is  not  running.  In  such  a  case  the  remaining 
line  information  gives  the  name  of  the  TIP  executable  and  the  start-up  parameters. 
Below  is  example  udi_soc  file  contents. 


mon 

serial 

iss 

pcserver 

cruncher 

netrom 


AF_UN I X 
AF_UN I X 
AF_UN I X 
AF_UN I X 
AF_INET 
AF_UN I X 


mon_soc 

★ 

iss_soc 

pc_soc 

hotbox 

net_soc 


montip  -t  serial  -baud  38400  -com  /dev/ttya 
montip  -t  serial  -baud  9600  -com  /dev/ttya 
iss  -r  . . / . . /src/osboot/sim/osboot 
pcserver  -t  serial  -baud  9600  -com  /dev/ttya 
7000 

montip  -t  netrom  -netaddr  163.181.22.41  ... 


The  relative  path  names  given  with  montip  start-up  parameters,  are  relative  to: 
cmontip  executable  directory>/../lib  .  The  path  given  with  the  “-r”  option  is  required 
to  find  the  OS-boot  code  for  29K  start-up.  When  the  DFE  is  always  used  to  automati¬ 
cally  start  the  TIP  process,  a  can  be  used  for  the  socket  name  field.  This  causes  the 
DFE  to  generate  a  random  name  for  the  socket  file.  This  file  will  be  removed  when 
the  DFE  and  TIP  discontinue  execution  at  the  end  of  the  debug  session. 

The  DOS  Configuration  File  ( udiconfs.txt )  format  is  a  little  simpler.  There  are 
only  three  entry  fields,  as  shown  by  the  example  below: 
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mon 
serial 
sim 
eb2  9K 
yarcrev8 


montip . exe  -t 
montip . exe  -t 
iss.exe  -r 
montip . exe  -t 
montip . exe  -t 


serial  -baud  38400 

serial  -baud  9600 

. . V. . \src\osboot\sim\osboot 

eb29K  -r  . . \ . \src\minimon\eb2 9K\mon . o 

yarcrev8  -r  . . \ . . \src\minimon\yarcrev8\mon . os 


The  first  field  is  again  the  session  identifier.  The  second  and  third  fields  contain 
the  TIP  executable  file  name  and  its  start-up  option  switches.  All  DFEs  have  some 
kind  of  command-line  or  interactive  command  which  allows  the  session_icl  value  to 
be  entered.  The  DFE  then  reads  the  UDI  Configuration  File  to  determine  the  TIP  with 
which  communication  is  to  be  established.  Most  DFEs  (mondfe  has  the  disc  com¬ 
mand)  have  a  command  which  enables  the  DFE  to  disconnect  from  the  TIP,  cease  to 
execute  but  leave  the  TIP  running.  Because  the  TIP  is  still  alive  and  ready  to  service 
UDI  requests,  a  DFE  can  start-up  later  and  reconnect  with  the  TIP.  However,  typical¬ 
ly  the  DFE  and  TIP  processes  are  terminated  at  the  same  time. 

7.5.7  The  UDI  -  Instruction  Set  Simulator  Connection,  ISSTIP 

An  Instruction  Set  Simulator,  isstip,  is  available  for  DOS  and  UNIX  type  hosts. 
The  isstip  process  is  fully  UDI  conforming  and  can  be  used  by  any  DFE.  Because  of 
existing  contract  limitations,  AMD  normally  ships  isstip  in  binary  rather  than  source 
form.  Using  the  simulator  along  with,  say  the  gdb  DFE,  is  a  convenient  and  powerful 
way  of  exercising  29K  code  without  ever  having  to  build  hardware.  Thus,  software 
engineers  can  use  the  simulator  while  a  project’s  hardware  is  still  being  debugged. 

The  Instruction  Set  Simulator  can  not  be  used  for  accurate  application  bench¬ 
marking,  as  the  system  memory  model  can  not  be  incorporated  into  the  simulation. 
AMD  supplies  the  architectural  simulator,  sim29,  for  that  purpose  (see  Chapter  1). 
The  simulator  supports  several  command  line  options,  as  shown  below.  For  an  ex¬ 
planation  of  these  options,  enter  isstip  or  man  isstip,  on  your  TIP  host  machine. 

isstip  [-r  osboot_file]  [-29000  I -29050 | -29030 | -2  9200]  [ -t ]  [-tm] 

[-id  <0|1>]  [-sp  < 0 | 1 > ]  [-st  <hexaddr>]  [-ww]  [-le]  [-p|v] 

With  the  -r  option,  the  osbootjile  is  loaded  into  memory  at  address  0.  This  is 
useful  for  installing  operating  systems  like  OS-boot  before  application  code  starts 
executing.  With  processors  which  support  separate  Instruction  and  ROM  memory 
spaces,  the  osbootj'ile  is  loaded  into  ROM  space.  If  the  -r  option  is  not  used,  the  sim¬ 
ulator  will  intercept  HIF  service  calls  and  perform  the  necessary  operating  system 
support  service.  The  simulator  always  intercepts  HIF  services  with  service  numbers 
255  and  less  regardless  of  the  -r  option.  These  HIF  services  are  provided  directly  by 
the  simulator. 

The  simulator  is  very  useful  for  debugging  Freeze  mode  code.  It  will  allow 
single  stepping  through  Freeze  mode  code  which  is  not  possible  with  a  real  processor 
unless  it  supports  Monitor  mode.  Freeze  mode  code  is  normally  supplied  in  the  op- 
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tional  osbootjile.  Thus,  the  -r  option  must  be  used  to  enable  Freeze  mode  debug¬ 
ging.  Additionally,  to  enable  debugging  of  Freeze  mode  timer  interrupts  the  —tm  op¬ 
tion  must  also  be  selected  to  enable  timer  interrupt  simulation.  The  simulator  normal¬ 
ly  intercepts  floating-point  traps  and  performs  the  necessary  calculation  directly. 
Simulation  speeds  are  reduced  if  floating-point  trapware  is  simulated.  However,  if 
the  trapware  is  to  be  debugged  the  —t  option  must  be  used  to  enable  trapware  simula¬ 
tion. 

When  the  isstip  process  is  started  it  advertises  its  readiness  to  service  UDI  re¬ 
quests.  A  DFE  process  will  typically  connect  to  the  TIP  process  and  a  debug  session 
will  commence.  However,  it  is  more  typical  to  get  the  DFE  to  start  the  TIP.  The 
mondfe  process  starts  the  TIP  during  the  DFE  start-up  process.  The  gdb  DFE  starts 
the  TIP  after  the  target  gdb  command  is  used.  The  start-up  isstip  parameters  are  tak¬ 
en  from  the  “UDI  Configuration  File”.  The  format  of  this  file  is  explained  in  the  pre¬ 
vious  section  discussing  mondfe. 

7.5.8  UDI  Benefits 

A  number  of  debug  tool  developers  are  currently,  or  will  be  shortly,  offering 
tools  which  are  UDI  compliant.  Typically  the  DFEs  are  C  source  level  debuggers. 
This  is  not  surprising,  as  the  increased  use  of  RISC  processor  designs  has  resulted  in  a 
corresponding  increase  in  software  complexity.  The  use  of  a  high  level  language, 
such  as  C,  is  more  productive  than  developing  code  at  machine  instruction  level.  And 
further,  the  use  of  C  enables  much  greater  portability  of  code  among  current  and 
future  projects.  The  low  cost  of  GDB  makes  it  an  attractive  choice  for  developers. 

Target  processors  and  their  control  mechanisms  are  much  more  varied  than 
Debugger  Front  Ends  (DFEs).  I  have  briefly  described  the  MiniMON29K  TIP,  which 
is  a  process  which  controls  the  execution  of  a  29K  processor.  A  small  amount  of  code 
known  as  the  DebugCore  is  placed  in  processor  ROM  memory  and  enables 
examination  of  the  processor  state.  The  MiniMON29K  TIP  communicates  with  the 
DebugCore  via  a  hardware  link  which  is  specific  to  the  embedded  application 
hardware. 

Other  TIPs  already  exist  and  more  are  under  development.  There  is  a  29K 
simulator  (ISS)  which  runs  on  UNIX  and  DOS  hosts.  The  DFE  communicating  with 
the  simulator  TIP  is  unaware  that  the  29K  processor  is  not  present,  but  being 
simulated  by  a  process,  executing  on,  say,  a  UNIX  workstation.  There  are  also  tool 
developers  constructing  TIP  programs  to  control  processor  emulators.  This  will 
make  possible  a  top-of-the-line  debug  environment. 

UDI  makes  possible  a  wider  tool  choice  for  application  code  developers. 
Debugger  front  end  tools  are  supplied  separately  from  target  control  programs.  The 
user  can  consider  cost,  availability  and  functionality  when  selecting  the  debug 
environment.  This  level  of  debug  tool  configurability  has  not  been  available  to  the 
embedded  application  development  community  in  the  past. 
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Because  debuggers  like  GDB  are  available  in  source  form,  developers  can  add 
additional  debug  commands,  such  as  examination  of  real-time  operating  system 
performance.  This  would  require  adding  operating  system  structural  information 
into  GDB.  When  the  debugger  front  end  and,  for  example,  an  emulator  interface 
module,  are  supplied  as  a  single  executable,  adding  new  commands  is  not  possible. 
Via  the  use  of  Internet  sockets  the  debugger  may  execute  on  a  different  networked 
host  than  the  node  supporting  the  emulator  control  process. 

7.5.9  Getting  Started  with  GDB 

To  demonstrate  the  operation  of  GDB  debugging  a  program  running  on  an 
Am29000  processor,  the  program  below  was  compiled  using  the  Free  Software 
Foundation’s  GCC  compiler.  The  example  is  simple,  but  it  does  help  to  understand 
the  GDB-MiniMON29K  monitor  debug  mechanism.  A  stand-alone  Am29000 
processor  development  card  was  used.  It  contains  a  UART  and  space  for  RAM  and 
EPROM  devices.  The  MiniMON29K  monitor  modules  were  linked  with  a  HIF 
operating  system  support  module  (OS-boot)  and  an  Am85C30  UART  message 
driver  module  [AMD  1988].  The  linked  image  was  installed  in  EPROM  devices  in 
the  target  hardware.  A  serial  cable  was  then  used  to  connected  the  UART  to  a  port  on  a 
SUN-3/80  workstation. 

The  demonstration  could  have  been  equally  as  well  been  performed  on  a 
386-based  IBM-PC;  the  target  hardware  being  connected  via  a  PC  serial  port. 
Alternatively,  there  are  a  number  of  manufactures  building  evaluation  cards  which 
support  a  dual-ported  memory  located  on  a  PC  plug-in  card  containing  the  RISC 
processor.  The  386  communicates  with  the  target  processor  via  a  shared  memory 
interface.  This  requires  a  TIP  which  can  communicate  via  shared  memory  with  the 
DebugCore  running  on  the  target  hardware.  A  number  of  such  TIP  control  processes 
have  been  built.  A  board  developer  has  only  to  implement  the  TIP  portion  of  the 
debug  mechanism  to  gain  access  to  a  number  of  debuggers  such  as  GDB  which  are 
UDI  conforming.  Note,  due  to  an  implementation  limitation  of  the  current  DOS 
version  of  GDB,  it  is  necessary  to  start  the  TIP  process  manually.  GDB  is  unable  to 
automatically  start  the  montip  or  isstip.  The  command  shown  below  most  be  used  to 
start  montip  on  a  DOS  host  before  GDB  can  communicate  with  the  target  29K 
system. 


montip  montip.exe 


The  demonstration  program,  listed  below,  simply  measures  the  number  of  char¬ 
acters  in  the  string  supplied  as  a  parameter  to  the  main()  function. 

main(argc,  argv)  /*  program  measure. c  */ 

int  argc; 
char  *argv [ ] ; 

{ 
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int  len; 

if (argc  <  2)  return; 
len  =  strlen (argv [ 1 ]) ; 
printf ("length=%d\en",  len); 

} 


int  strlen (s) 
char  *s; 

{ 

int  n; 

for  (n  =  0;  *s  !=  ' \0' ;  s++) 
n++ ; 

return (n) ; 

} 


GDB  was  started  running  on  the  UNIX  machine.  The  target  command  was  used 
to  establish  communication  with  the  DebugCore  mnning  in  the  standalone  develop¬ 
ment  card.  The  UDI  Configuration  file  was  used  to  establish  DFE  and  TIP  commu¬ 
nication.  The  format  of  the  Configuration  File  was  described  in  section  7.5.6.  The 
UDI  session _icl  for  the  example  shown  is  monitor.  The  list  below  presents  the  re¬ 
sponse  seen  by  the  user.  The  keyboard  entries  made  by  the  user  are  shown  in  bold 
type. 


gdb 

GDB  is  free  software  and  you  are  welcome  to  distribute  copies  of 
it  under  certain  conditions;  type  "show  copying"  to  see  the 
conditions.  There  is  absolutely  no  warranty  for  GDB;  type  "show 
warranty"  for  details.  GDB  4.5.2,  Copyright  1992  Free  Software 
Foundation,  Inc. 

(gdb)  target  udi  monitor  measure 

Remote  debugging  Am29000  rev  D  Remote  debugging  an  Am29000 
connected  via  UDI  socket,  DFE-IPC  version  1.2.1  TIP-IPC  version 
1.2.1  TIP  version  2.5.1  MONTIP  UDI  1.2  Conformant 

Once  communication  had  been  established,  a  breakpoint  was  set  at  the  entry  to 
the  strlen()  function.  Execution  was  then  started  using  the  run  command.  GDB  in¬ 
forms  the  user  that  the  program  is  being  loaded.  This  is  accomplished  by  the  TIP 
sending  messages  to  the  debug  core,  which  transfers  the  accompanying  message  data 
into  Am29000  processor  memory  before  Am29000  processor  execution  com¬ 
mences. 


(gdb)  symbol  measure 

Reading  in  symbols  for  measure . c ...  done . 

(gdb)  break  strlen 

Breakpoint  1  at  0x10200:  file  measure. c,  line  14. 
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( gdb )  run  measure_my_length 

Loading  TEXT  section  at  0x10000  (24408  bytes)  ... 

Loading  DATA  section  at  0x80003000  (4096  bytes)  . . . 

Clearing  BSS  section  at  0x80004000  (0  bytes)  . . . 

Breakpoint  1,  strlen (s=0x80004013  "measure_my_length" ) 

(measure. c  line  17 

17  for  (n  =  0;  *s  !=  ' \0' ;  s++) 

The  program  runs  until  the  requested  breakpoint  is  encountered.  At  this  point  a 
source  code  listing  was  requested.  Typically,  debug  monitors  do  not  allow  source 
code  to  be  viewed.  The  use  of  GDB  makes  this  important  advantage  available  to  the 
embedded  software  developer. 


(gdb) 

11 

list 

12 

int  strlen  ( s ) 

13 

char  *s; 

14 

( 

15 

int  n; 

16 

17 

for  (n 

=  0; 

18 

n++ ; 

19 

return 

(n)  ; 

20 

> 

The  user  then  examined  the  call-stack  history  using  the  info  stack  command. 
This  is  currently  inefficiently  implemented.  GDB  uses  the  to_xfer_memory()  proce¬ 
dure  to  send  read  messages  to  the  target  DebugCore.  Examining  the  instruction 
memory  in  this  way  is  much  less  efficient  than  requesting  the  DebugCore  to  search 
back  through  its  own  memory  for  procedural  tag  words.  Each  procedure  has  a  non¬ 
executable  trace-back  tag  word,  or  two,  placed  before  the  first  instruction  of  the  pro¬ 
cedure  (see  Chapter  3).  Tag  words  enable  debuggers  to  quickly  gain  information 
about  a  procedure  frame,  and  hence  variable  values.  Adding  the  procedural  “hook”  to 
GDB  to  make  use  of  the  MiniMON29K  monitor  FIND  service  would  greatly  reduce 
message  traffic,  and  improve  the  users  response  time  for  the  info  stack  command. 

(gdb)  info  stack 

#0  strlen  (s=0x80004013  "measure_my_length" )  (measure . c  line  17) 

#1  OxlOlac  in  main  (argc=2,  argv=0x80004000)  (measure . c  line  8) 

GDB  enables  single  stepping  of  source  code  with  the  step  or  next  commands. 
The  listing  shows  a  source-level  step  request  followed  by  the  printing  of  procedural 
variables  “n”  and  “s”.  With  large  embedded  programs  it  is  important  to  be  able  to  de¬ 
bug  at  source-level,  and  examine  variables  without  having  to  look  at  cross-listing 
mapping  tables  to  find  the  address  associated  with  a  variables  memory  location.  Typ¬ 
ically  small  embedded  debug  monitors  do  not  support  this  kind  of  debugging. 

(gdb)  step 

17  for  (n  =  0;  *s  !=  ' \0' ;  s++) 
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(gdb)  print  n 
$1  =  0 

(gdb)  print  s 

$2  =  (unsigned  char  *)  0x80004013  "measure_my_length" 

Embedded  applications  often  deal  with  controlling  special  purpose  hardware 
devices.  This  may  involve  interrupt  handlers  and  assembly-level  code  which  oper¬ 
ates  with  processor  registers  reserved  for  the  task.  GDB  does  support  examination  of 
assembly  code  and  registers  by  name.  The  listing  below  shows  disassembly  from  the 
current  PC  location  (PCI  on  the  Am29000  processor).  The  si  command  was  then 
used  to  single  step  at  machine  instruction  level.  The  cont  command  caused  execution 
to  continue  to  completion,  as  no  further  breakpoints  were  encountered. 

The  result  of  the  printf()  function  call  can  finally  be  seen.  This  function  relies  on 
the  operating  system  making  use  of  MiniMON29K  monitor  messages.  The  HIF-OS 
write()  system  call,  like  the  DebugCore,  sends  the  required  message  to  the  host  pro¬ 
cessor.  However,  in  the  case  of  operating  system  messages,  the  message  is  not  nor¬ 
mally  sent  to  the  GDB  module  but  to  the  HIF-OS  support  module.  An  exception  is 
made  in  the  case  of  a  read()  or  write()  to  the  standard-in  or  -out  channel.  Related 
messages  are  relayed  via  UDI  to  GDB  which  must  control  both  the  displaying  of  re¬ 
ceived  data  on  the  screen  and  sharing  the  keyboard  between  the  application  and  the 
debugger  itself. 


(gdb  x/4i  $pc 
0x10228  <strlen+64>: 

0x1022c  <strlen+68>: 

0x10230  <strlen+72>: 

0x10234  <strlen+76>: 

(gdb)  si 
0x1022c 

18 

(gdb)  p/x  $pc 
$3  =  0x0001022c 

(gdb)  p/cont 
Continuing. 
length=17 

7.5.10  GDB  and  MiniMON29K  Summary 

GDB  is  a  powerful  debug  tool  which  can  be  applied  to  the  problem  of  develop¬ 
ing  software  for  embedded  applications.  The  MiniMON29K  monitor  DebugCore 
and  message  handling  modules  enable  GDB  to  be  simply  incorporated  in  a  wide 
range  of  embedded  systems.  The  MiniMON29K  monitor  has  only  a  small  memory 
requirement  and  does  not  require  processor  registers  to  be  reserved  for  its  use. 


sub  grll7,lrl,8 
load  0,  0x0, grll8, grll7 
add  grll8, grll8, 1 
store  0,  0x0, grll8, grll7 


n++ ; 
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Users  are  free  to  incorporate  their  own  real-time  operating  system,  or  alterna¬ 
tively  make  use  of  the  HIF  operating  system  module.  Because  GDB  is  available  in 
source  form,  it  can  be  extended  to  understand  real-time  operating  system  support 
data  structures.  Purchasers  of  third  party  executives,  or  those  who  choose  to  build 
their  own,  should  not  find  it  difficult  to  extend  GDB  to  analyze  the  real-time  operat¬ 
ing  system  control  parameters,  via  the  Universal  Debugger  Interface  standard. 

The  increased  complexity  of  many  applications  being  solved  by  RISC  proces¬ 
sor  designs  have  a  corresponding  increase  in  software  complexity.  The  low  cost  of 
GDB  and  its  associated  productivity  make  it  an  attractive  choice  for  developers. 


7.6  SIMPLIFYING  ASSEMBLY  CODE  DEBUG 

It  would  be  ideal  to  have  a  whole  chapter  dedicated  to  the  subject  of  Designing 
for  Debug.  However,  size  constraints  have  restricted  this  section  to  a  few  hints  about 
how  to  better  develop  assembly  code.  Certainly  those  developing  29K  based  systems 
should  first  consider  the  difficulties  (if  any)  of  connecting  logic  analyzers,  ROM 
emulators  or  in-circuit  emulators  to  their  designs  before  constructing  any  circuitry. 
Tool  suppliers  as  well  as  AMD  support  services  and  literature  provide  useful  in¬ 
formation  with  regard  to  planning  for  debug.  This  information  should  be  obtained 
and  studied  at  the  early  stages  of  a  project. 

When  developing  a  program  in  a  high  level  language  such  as  C,  the  compiler  can 
be  direct  to  provide  the  necessary  debug  information  in  the  output  object  file  (COFF 
file).  With  the  High  C  29K  compiler,  as  with  most  C  compilers,  the  “-g”  switch  in¬ 
forms  the  compiler  that  additional  debug  information  should  be  provided  by  the  com¬ 
piler.  Source  level  debuggers,  such  as  UDB  or  GDB,  need  the  additional  information 
in  order  to  correctly  perform  their  task.  Using  High  C,  it  is  possible  to  examine  the 
assembly  level  directives  which  result  from  the  use  of  the  “-g”  compiler  switch.  For 
example,  use  the  command  “hc29  -S  -Hanno  -g  file.c”  to  produce  a  file  called 
“file.s”  which  has  high  level  language  debug  directives  embedded  among  the  29K 
assembly  code. 

When  developing  programs  at  assembly  level  it  is  best  to  include  the  high  level 
debug  directives  —  too  frequently  assembly  language  developers  omit  this  task.  Di¬ 
rectives  can  be  added  to  provide  symbol-table  and  line  number  information  for  the 
assembly  files.  This  simplifies  the  task  of  later  debugging  the  assembly  code.  For  ex¬ 
ample,  the  swaf  utility  can  be  used  to  read  COFF  files  and  produce  an  information  file 
in  Hewlett  Packard’s  General  Purpose  Ascii  (GPA)  format.  The  GPA  file  can  be 
loaded  into  an  HP  1 6500B  logic  analyzer,  enabling  the  analyzer  to  display  symbol  in¬ 
formation  rather  than,  say,  hex  address  values.  Further,  using  HP’s  B3740A  Software 
Analyzer  product  in  conjunction  with  their  logic  analyzer,  trace  of  source  line  execu¬ 
tion  is  possible  if  line  number  information  has  been  provided. 
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It  is  best  to  use  macro  instructions  to  embedded  the  required  symbol-table  and 
line  number  information.  The  FUNC  maco  is  used  in  the  following  example  to  pro¬ 
vide  information  about  function  dbg_w_glob()  which  is  written  in  29K  assembler. 

; Write  absolute  global  registers  with  memory  resident  data. 

; dbg_w_glob (dest_p,  src_p) ; 

FUNC  _db g_w_g lob,  LINE 

mtsr  IPA,  lr2 

jmpi  IrO 

load  0,0,gr0,lr3 
ENDFUNC  _dbg_w_glob,  _ LINE 

Macro  instruction  ENDFUNC  is  used  to  mark  the  end  of  the  function.  Both 
macros  receive  two  parameters;  the  first  is  the  name  of  the  function,  the  second  pro¬ 
vides  line  number  information.  Symbol _ LINE _  is  expanded  by  the  C  pre-pro¬ 

cessor  utility,  cpp,  which  is  available  with  most  systems  supporting  Unix.  Note,  user 
of  High  C  29K  version  3.3  or  newer  will  not  need  to  use  cpp  as  the  assembler  directly 

supports  the  use  of  the _ LINE _ ,  symbol.  When  a  file  is  processed  by  cpp,  the 

_ LINE symbol  is  replaced  by  the  current  line  number.  Unfortunately,  cpp  adds  a 

line  at  the  start  of  its  output  file  which  does  not  comply  with  29K  assembler  syntax. 
This  line  is  simply  removed  using  the  tail  Unix  utility.  In  general,  to  support  line 
number  expansion,  command  lines  similar  to  the  following  three  must  be  added  to  a 
Unix  makefile  for  each  assembly  source  file. 

cpp  file.s  >  tmp.s  #  run  C  pre-processor 

tail  +2  tmp.s  >  _file.s  #  Use  Unix  "tail"  utility 

as29  _file.s  #  assemble  file 


; IPA  set  to  destination 
; return 
; read  memory 


The  listing  below  shows  the  code  used  to  implement  the  FUNC  macro.  The  re- 

turn  type  of  the  function  is  “int” 

(T_INT).  A  tagword  is  provided  but  the  field  details 

are  not  constructed. 

.macro  FUNC, 

fname, 

f  line 

.  def 

f  name 

; start  symbol-table  entry 

.  val 

fname 

/value  of  symbol  =  address 

.  scl 

2 

; storage  class  =  C_EXT 

.type 

0x24 

/type  of  symbol  =  T_INT ( ) DT_FCN 

. endef 

/end  of  symbol-table  entry 

.  word 

0x0 

/Tag  word 

. global 

fname 

f name : 

.  def 

.bf 

/ start  symbol-table  entry 

.  val 

/value  of  symbol  =  PC  address 

.  scl 

101 

/ storage  class  =  C_FCN 

.  line 

f  line 

/ source  line  number 

.  endef 

/end  of  symbol-table  entry 

.  In 

1 

/ line  number  within  new  section 

.  endm 
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The  listing  below  shows  the  code  used  to  implement  the  ENFFUNC  macro. 
There  are  a  number  of  high  level  language  support  directives  required  to  specify  an 
end  of  function.  The  comment  fields  explain  the  symbol-table  definition  used. 


.macro  ENDFUNC,  fname, 
.  de  f  .  e  f 

.  val 

. scl  101 
.line  fline 
. endef 

. def  fname 

.  val 

.scl  -1 
. endef 

.  endm 


fline 

; start  symbol-table  entry 

; value  of  symbol  =  address 
; storage  class  =  C_FCN 
; source  line  number 
;  end  of  symbol-table  entry 
; start  symbol-table  entry 

;  value  of  symbol  =  PC  address 
;class=  C_EFCN  (func.  end) 

; end  of  symbol-table  entry 


Assembly  macros  can  also  be  used  to  provide  type  information  for  data  which  is 
defined  in  assembly  level  modules.  Symbol-table  information  for  data  variables  is 
usually  less  useful  than  information  about  functions.  However,  if  assembly  level  di¬ 
rectives  are  not  used,  then  all  data  will  appear  to  be  of  type  “char”  (T_CHAR).  Some 
debuggers  may  be  confused  by  this  and  will  not  be  able  to  correctly  report  which  sim¬ 
ple  has  been  accessed  during  a  load  or  store  operation.  The  example  below  shows 
how  the  INT_32  and  INT_32_ARY  macros  can  be  used  to  define  variables.  In  the 
example,  the  variables  are  located  in  a  BSS  (un-initialized  data)  region.  The  macros 
provide  the  high  level  language  directives  which  result  in  the  correct  symbol-table 
information. 


.sect  dbg_bss,bss 

.use  dbg_bss 

.align  4 

INT_32  _dbg_tmp_reg  ; 32-bit  uninitialised  data 

INT_32  _dbg_tmp_p  ; 4 -byte  data 

INT_32_ARY  _dbg_return,  8  ;  8  *  4-byte  array 

The  listing  below  shows  the  code  used  to  implement  the  INT_32  macro.  The 
symbol  if  of  type  “inf  ’  (T_INT).  The  enumeration  (4)  used  for  this  type  can  be  found 
in  the  documentation  supporting  C OFF.  Alternatively,  the  C  compiler  can  be  run  with 
the  “-g”  switch  and  the  output  examined. 

.macro  INT_32,  name 
name:  .block  4 

.def  name 

.val  name 

.scl  2 

.type  0x4 

. endef 

.  endm 

The  INT_32_ARY  macro  is  shown  below.  This  macro  is  a  little  more  complex 
as  it  declares  an  array.  The  two  macros  shown  here  are  useful  but  do  not  represent  the 


; start  symbol-table  entry 

;value  of  symbol,  address 
; storage  class,  C_EXT 
; type  of  symbol,  T_INT 
; end  of  symbol-table  entry 
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complete  range  of  macros  which  would  be  required  to  describe  all  data  types.  How¬ 
ever,  given  these  examples,  it  should  not  be  difficult  to  construct  any  other  macros 
required. 


.macro  INT_32_ARY,  name, 
name:  .block  4  *  size 

. def  name 
. val  name 
.  scl  2 
.dim  size 
.type  0x34 
. endef 

.  endm 


; start  symbol-table  entry 

;value  of  symbol,  address 
; storage  class,  C_EXT 
; dimension  of  array 
; symbol  type  =  T_INT  and  DT_ARY 
;  end  of  symbol-table  entry 


7.7  SOURCE  LEVEL  DEBUGGING  USING  A  WINDOW  INTERFACE 

There  are  a  number  of  source  level  debuggers  available  for  the  29K  family 
which  support  a  windowed  user  interface;  primarily  windowed-gdb,  xray29u  and 
UDB.  A  windowed  debugger  is  appealing  to  many  development  engineers  because 
of  its  convenient  interface  and  potentially  greater  productivity.  For  the  benefit  of 
those  engineers  involved  in  embedded  processor  development  who  have  not  yet  had 
the  opportunity  to  experience  a  windowed  debugger,  this  section  gives  a  brief 
introduction  to  the  topic.  For  illustration  purposes,  the  UDB  universal  source  level 
debugger  is  used. 

UDB  was  specifically  designed  for  embedded  software  development. 
Consequently,  UDB  provides  a  Generic  I/O  (GIO)  interface  alternative  to  ptrace() 
for  communicating  with  the  target  29K  system.  The  GIO  code  runs  as  a  separate 
process  from  the  UDB  process.  The  two  processes  communicate  via  a  socket 
connection  on  Unix  hosts.  This  enables  the  GIO  process  to  be  provided  in  source 
form  without  having  to  make  UDB  source  code  available.  A  UDI  conformant  version 
of  UDB  is  available  for  Unix  hosts.  This  was  achieved  by  interfacing  the  GIO 
interface  to  the  UDI-p  library,  see  Figure  7-10.  Currently,  a  UDI  interface  for  PC 
Windows  is  under  development,  and  this  will  lead  to  a  UDI  conformant  version  of 
UDB  for  PC  Windows.  CaseTools  Inc.,  the  developer  of  UDB,  has  a  GIO 
implementation  available  which  is  not  interfaced  to  the  UDI  standard,  but 
communicates  directly  with  a  CaseTools  maintained  monitor  known  as  UMON. 
Currently,  UDB  for  Windows  operates  with  UMON  rather  than  the  DebugCore 
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which  is  supplied  as  part  of  the  MiniMON29K  bundle.  This  will  be  the  position  until 
the  UDI  for  PC  Windows  specification  has  been  completed. 

When  a  UDI  conformant  version  of  UDB  is  used  with  a  29K  evaluation  board, 
establishing  UDB  operation  is  very  simple.  The  mktarget  command  in  the  udb.rc 
start-up  file  is  used  to  start  a  GIO  processes  which  supports  the  UDI  interface.  The 
GIO  process  uses  the  assigned  mktarget  parameters  to  select  the  entry  in  the  udi_soc 
file  which  is  then  used  to  establish  the  DFE-TIP  connection  (see  section  7.9.2).  In 
this  way,  it  is  particularly  convenient  to  use  UDB  with  the  instruction  set  simulator 
ISSTIP.  Similarly,  UDB  can  be  configured  to  connect  to  MonTIP  which  itself 
communicates,  for  example,  via  a  serial  link  or  NetROM  (ROM  emulator),  to  the 
DebugCore  running  in  the  29K  system. 

Using  UDB  with  UMON  rather  than  MiniMON29K,  is  also  simple  when  the 
29K  evaluation  board  has  UMON  installed  and  running.  CaseTools  recommends  that 
UMON  first  be  linked  with  29K  boot-up  code  know  as  boot-crtO  and  then  installed 
in,  say,  ROM  on  the  target  29K  system.  A  CaseTools  customer  is  required  to 
construct  their  own  boot-crtO  code.  This  could  be  accomplished  using  the  OS-boot 
code  provided  by  AMD. 

However,  because  AMD  provides  29K  evaluation  boards  with  MiniMON29K 
already  installed,  some  developers  and  evaluators  may  wish  to  run  UMON  without 
first  constructing  a  boot-crtO.  The  UMON  monitor  can  be  run  on-top  of  the 
MiniMON29K  DebugCore,  and  where  necessary  the  application  can  make  use  of 
services  provided  by  OS-boot  in  place  of  the  missing  boot-crtO.  A  number  of 
preparation  steps  must  be  taken  to  make  this  tool  combination  operate  correctly. 
These  steps  are  explained  below.  It  is  important  to  remember  these  steps  only  apply 
when  launching  UMON  from  MiniMON29K.  No  special  linking  and  loading  steps 
are  required  if  UDB  is  used  directly  with  MiniMON29K  or  directly  with  UMON 
combined  with  an  appropriate  boot-crtO. 


Figure  7-10.  The  UDB  to  29K  Connection  via  the  GIO  Process 
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MiniMON29K-  UMON  Differences 

When  using  UMON  rather  than  MiniMON29K  components,  a  different  crtO  file 
must  be  linked  with  application  code  (the  crtO  file  linked  with  application  code 
should  not  be  confused  with  the  boot-crtO  file  linked  with  UMON).  There  are  a  num¬ 
ber  of  reasons  for  this;  for  example,  MiniMON29K  normally  clears  the  application 
BSS  data  region  when  a  new  program  is  prepared  for  execution.  UMON  does  not 
clear  the  BSS  region  and  hence  this  task  must  now  be  performed  by  the  crtO  file 
linked  with  the  application.  Normally  the  default  crtO  file  provided  with  the  compiler 
is  linked  ahead  of  any  application  code. 

UMON  expects  OS-boot  to  satisfy  HIF  service  requests  —  at  least  for  service 
numbers  256  and  greater.  However,  service  numbers  255  and  lower  require  OS-boot 
to  request  help  from  MonTIP.  For  this  purpose,  OS-boot  exchanges  MiniMON29K 
messages  with  MonTIP.  When  the  DebugCore  is  replaced  by  UMON,  the 
MiniMON29K  message  system  is  also  replaced  by  the  UMON  communication 
mechanism.  Hence,  OS-boot  can  not  be  used  to  perform  HIF  services  with  service 
number  255  or  lower. 

To  support  HIF  services  such  as  write  (service  number  20)  which  is  used  by  the 
printf()  library  routine,  UDB  is  provided  with  a  library  which  supplies  routines,  such 
as  _write(),  which  interface  to  the  UMON  communication  mechanism.  This  library, 
libudb.lib,  must  be  linked  with  application  code.  The  libudb.lib  library  must  be 
linked  ahead  of  the  default  libraries  supplied  with  the  compiler,  as  the  default  li¬ 
braries  also  contain  system  call  glue-routines  such  as  _write();  but  these  now  un¬ 
wanted  routines  request  HIF  services  supported  by  OS-boot. 

Compiling  a  Program  for  UMON  Debugging 

When  UMON  is  launched  from  MiniMON29K,  care  must  be  taken  when 
building  application  programs  for  debug.  A  makefile  for  driving  the  High  C  29K 
compiler  is  provided  with  UMON  as  a  template  for  building  application  programs. 
The  makefile  builds  a  link-command-file  and  ensures  the  correct  files  are  linked  in 
the  correct  order.  The  “APP=//7?”  line  at  the  top  of  the  makefile  must  be  modified  to 
change  the  application  program  being  build. 

Alternatively,  the  compiler  can  be  driven  directly  from  the  command  line.  First, 
alternative  files  must  be  copied  from  the  UMON  installation  to  the  High  C  29K 
installation  directories.  Copy  the  file  /udb/apps/crtO.o  to  /29k/lib/udb_crt0.o  and  file 
/udb/apps/libudb.lib  to  /29k/lib/lib udb . lib .  These  files  are  referenced  by  the  com¬ 
mand  file  /udb/apps/udb.cmd  which  should  also  be  copied  to  /29k/lib/udb.cmd.  File 
fib.c  can  then  be  compiled  with  the  command: 

hc29  -g  -o  fib.abs  -nocrtO  -cmdudb.cmd  fib.c 

The  “-nocrtO”  option  supresses  linking  of  the  default  /29k/lib/crt0.o  file.  The 
udb.cmd  file  is  configured  to  link  programs  starting  at  memory  location 
0x40040000.  This  is  suitable  for  use  with  an  S  A29200  evaluation  board.  The  address 
is  higher  than  usual  because  UMON  is  also  installed  in  the  29K’s  program  memory. 
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Figure  7-11.  UDB  Main  Window  Showing  Source  Code  Frame 
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UDB  does  not  load  symbol  information  directly  from  29K  COFF  files.  The 
utility  mksym  (see  section  2.6.1)  must  be  used  to  build  a  symbol  file  in  a  format 
understood  by  UDB.  The  command  below  builds  a  symbol  file  for  the  fib.abs  COFF 
file  compiled  earlier.  It  is  convenient  to  place  the  mksym  command  in  the  build 
makefile. 

mksym  fib.abs  fib.sym 

Preparing  for  UMON  Debugging 

When  MiniMON29K  is  used  to  launch  UMON,  the  UMON  monitor  must  first 
be  installed  in  the  target  system  memory.  Once  installed,  control  of  the  processor  is 
passed  from  the  DebugCore  to  UMON.  MonDFE  can  be  used  to  install  UMON.  As¬ 
suming  the  udi_soc  file  (udiconfs.txt  for  PC  hosts)  has  an  entry  “serial'’  for  establish¬ 
ing  operation  with  a  29K  target  board,  MonDFE  can  be  started  with  the  command: 

mondfe  -D  -TIP  serial. 

Once  MonDFE  is  started,  the  UMON  program  can  be  loaded  and  execution 
started.  At  this  stage  MiniMON29K  including  MonTIP  is  no  longer  needed.  The 
MonDFE  command  sequence  below  is  all  that  is  needed  to  get  UMON  running. 

y  /udb/umon/sa200/sa200x . abs 

g 

q 

Debugging  a  program 

As  noted  previously,  if  MiniMON29K  is  running  on  the  target  29K  system,  a 
UDB  can  be  started  which  utilizes  a  UDI  conformant  GIO.  If  UMON  is  running  on 
the  29K  target  system,  a  non-UDI  conformant  UDB  should  be  started.  Double  click¬ 
ing  on  the  UDB29K  icon  or  starting  UDB  program  execution  from  the  command  line 
will  establish  a  connection  with  the  debug  monitor  (assuming  the  udb.rc  command 
file  contains  a  mktarget  command).  The  file  udb.rc  is  read  by  UDB  during  the  start¬ 
up  sequence.  It  can  be  used  to  customize  UDB  operation. 

Once  UDB  has  started,  a  29K  application  program  can  be  loaded.  This  is  done 
by  using  the  upper  left  menu  item  File-load-Symbols  &  Executable .  Then  use  the 
menu  item  Execute-Run  until,  and  enter  the  label  “main”  when  prompted  for  an 
address;  displayed  in  the  window  will  be  the  source  file,  for  example  see 
Figure  7-11.  Once  a  window  has  been  created  it  can  be  used  to  display  different 
display  frames.  In  Figure  7-11  a  source  code  frame  is  displayed  in  the  newly  created 
window.  In  general,  any  window  can  display  any  frame  type.  The  following 
discussion  does  not  strictly  adhere  to  the  correct  terminology  for  frames  and 
windows.  In  particular,  where  it  is  convenient,  the  term  “window”  may  be  used  to 
refer  to  a  “frame”  within  a  given  window. 

It  is  usually  necessary  to  pop-up  a  window  displaying  a  Console  frame  to  enable 
program  input/output.  This  can  be  done  by  clicking  on  the  Con  button  (lower  right) 
while  holding  down  the  shift  key.  A  Console  window  will  appear  which  enables  ap¬ 
plication  input/output  information  to  be  displayed.  The  keyboard  echo  option  must 
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be  enabled  by  first  clicking  the  right  mouse  button  while  in  the  Console  window.  The 
left  button  can  then  be  used  to  select  “echo  on”  from  the  provided  menu.  Program 
execution  will  continue  when  the  Go  button  (top  middle)  is  pressed. 

The  method  used  to  pop-up  a  Console  window  can  also  be  used  to  pop-up  a 
range  of  other  debug  support  windows.  The  Asm  button  produces  a  window  display¬ 
ing  assembly  level  code.  The  assembly  code  window  shown  in  Figure  7-12  was  pro¬ 
duced  by  clicking  the  Asm  button.  The  high-lighted  code  line  (with  arrow)  corre¬ 
sponds  to  the  current  Program  Counter  (PC)  position.  The  corresponding  source  line 
was  high-lighted  in  the  source  code  window  (Figure  7-11).  All  windows  which  have 
a  mode  setting  (Source  or  Code  mode)  such  as  to  display  the  current  code  position, 
are  updated  automatically  when  the  PC  changes.  A  new  PC  value  is  reported  when¬ 
ever  program  execution  stops  due  to,  say,  single-stepping  a  source  or  assembly  code 
line,  or  hitting  a  breakpoint.  The  example  windows  have  a  breakpoint  set  at  the  first 
line  of  the  fib()  function.  The  right  mouse  button  can  be  used  to  select  the  current  dis¬ 
play  mode  for  a  window. 

This  section  is  too  brief  to  fully  describe  the  capabilities  of  UDB.  Developers 
typically  pop-up  a  number  of  windows  displaying  code,  memory  and  register  con¬ 
tents.  Windows  can  be  selected  and  arranged  in  a  way  suited  to  an  individual  develop¬ 
er  or  project’s  requirements.  As  a  further  example,  Figure  7-13  shows  a  window  dis¬ 
playing  global  register  contents.  The  window  is  updated  whenever  a  register  value 
changes.  A  new  value  can  be  entered  into  a  register  by  placing  the  cursor  over  the 
selected  register  data  value  and  typing  in  the  new  value.  The  right  mouse  button  can 
be  used  to  select  other  types  of  registers  for  displaying.  For  more  information  about 
UDB  commands,  consult  the  UDB  User’s  Guide. 

7.8  TRACING  PROGRAM  EXECUTION 

Tracing  program  execution  refers  to  recording  the  instruction  execution  and 
data  accesses  performed  by  a  processor.  Programs  are  normally  traced  up  to  a  break¬ 
point  or  other  event  causing  normal  instruction  execution  to  halt.  A  software  engi¬ 
neer  can  examine  the  trace  information  and  determine  the  program’s  operation  prior 
to  the  event.  The  technique  provides  the  software  developer  with  a  powerful  tool  for 
eliminating  software  bugs. 

A  tracing  capability  is  normally  provided  by  an  In-Circuit  Emulator  (ICE)  or 
logic  analyzer.  The  task  of  tracing  is  complicated  by  on-chip  instruction  and  data 
caches.  Without  caches,  processor  activity  is  fully  visible  from  the  memory  interface. 
When  an  access  is  performed  to  on-chip  cache,  it  is  not  normally  possible  to  deter¬ 
mine  the  address  or  the  data  accessed.  ICE  developers  can  overcome  this  problem, 
but  often  at  increased  tool  cost.  Those  using  a  logic  analyzer  to  perform  tracing  are 
traditionally  limited  to  debugging  with  caches  turned  off;  or  if  caches  are  enabled,  not 
being  able  to  observe  all  of  a  program’s  execution.  Embedded  systems  typically  have 
to  meet  stringent  timing  requirements  and  consequently  it  is  not  usually  possible  to 
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Figure  7-13.  UDB  Window  Showing  Global  Registers 


turn  off  caches.  It  is  unfortunate  that  the  use  of  a  logic  analyzer  is  restricted.  Logic 
analyzers  are  not  processor  specific  and  are  universally  used  by  hardware  develop¬ 
ment  engineers;  they  are  frequently  available  to  the  software  engineer  working  on 
embedded  product  development. 

The  29K  family  helps  overcome  the  problem  of  tracing  while  caches  are  turned 
on  by  employing  Traceable  Cache  technology.  The  2-bus  microprocessors  and  the 
high-end  microcontrollers  support  traceable  caching.  Later  in  this  section,  traceable 
cache  operation  is  described  in  detail  for  the  Am29040  processor. 

This  section  deals  with  the  use  of  a  logic  analyzer  as  a  software  debugging  tool. 
As  well  as  describing  the  problem  in  general  terms,  specific  material  is  included  re¬ 
lating  to  the  use  of  Hewlett-Packard  (HP)  logic  analyzers.  HP  was  chosen  because 
their  logic  analyzers  are  popular,  and  many  of  the  accompanying  support  tools  have 
been  adapted  to  operate  with  HP  analyzers.  Other  logic  analyzer  manufactures  and 
their  partners  have  also  developed  tool  combinations  which  support  source  level 
software  debugging. 
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Logic  Analyzer  Connection 

Many  of  the  evaluation  boards  offered  by  AMD  contain  sockets  suitable  for 
quick  connection  of  a  Hewlett-Packard  logic  analyzer.  This  simplifies  the  process  of 
connecting  the  analyzer  to  the  processor’s  signal  pins.  Certain  other  logic  analyzer 
manufacturers  support  a  compatible  termination  adapter  (pod)  format.  Logic  analyz¬ 
ers  such  as  the  HP16500B  (system)  and  the  HP166x  series  connect  directly  to  the 
evaluation  board  connectors.  This  is  convenient,  as  connecting  to  devices  in  pack¬ 
ages  other  than  PGA  can  be  cumbersome  and  unreliable.  Connection  to  your  own 
board  can  be  achieved  via  a  logic  analyzer  preprocessor:  a  preprocessor  consist  of  a 
small  circuit  board  which  connects  directly  into  the  processor  socket  (possibly  with 
the  aid  of  a  socket  extender).  A  replacement  processor  is  located  on  the  board  along 
with  an  array  of  analyzer  connection  sockets.  Corelis  Inc.  supply  preprocessors  for 
microprocessors  and  microcontrollers  in  the  29K  family. 

Microcontroller  members  of  the  29K  family  incorporate  on-chip  memory  inter¬ 
face  controllers.  This  results  in  the  microcontroller  providing  RAS  and  CAS  address 
information  separately  (multiplexed  on  the  same  address  pins)  rather  than  a  complete 
DRAM  address  value.  Consequently,  it  is  necessary  to  latch  the  RAS  address  in¬ 
formation  and  later  combine  the  CAS  address  bits  to  produce  a  complete  DRAM  ad¬ 
dress.  If  the  address  latching  technique  is  not  used,  then  the  logic  analyzer  can  not 
display  the  complete  address  used  for  a  DRAM  access.  This  is  very  inconvenient.  For 
this  reason,  AMD  provide  address  latching  circuitry  on  their  more  recent  microcon¬ 
troller  evaluation  boards.  Corelis  also  provide  address  latches  on  their  preprocessors. 
The  active  components  on  the  preprocessor  draw  power  from  the  pins  supplying 
power  to  the  processor. 

A  Logic  Analyzer  as  a  Software  Development  Tool 

Logic  analyzers  can  be  used  to  study  a  circuit’s  state  and  timing  information. 
Hardware  engineers  typically  display  state  information  in  hexadecimal  or  binary 
format  (see  Figure  7-14).  Software  developers  need  a  format  which  is  more  relevant 
to  their  task.  To  this  end,  Corelis  provide  a  tool  which  runs  on  the  logic  analyzer  and 
enables  the  processor  bus  signals  to  be  displayed  in  assembly  instruction  format.  The 
tool  is  used  in  conjunction  with  a  configuration  file  which  formats  the  analyzer  to  the 
assigned  preprocessor  signals.  (For  example,  file  POD_040._D  for  the  Am29040 
preprocessor.)  When  the  configuration  file  is  used,  the  task  of  first  assigning  labels  to 
the  termination  connector  signals  is  eliminated.  When  the  inverse  assembler  tool  is 
used,  the  DATA  label  shown  on  Figure  7-14  can  optionally  be  displayed  in  terms  of 
29K  assembly  instructions  rather  than  the  hexadecimal  equivalent. 
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Figure  7-14.  HP16500B  Logic  Analyzer  Window  Showing  State  Listing 
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Using  the  swaf  utility  described  in  section  2.6.1,  it  is  possible  to  display  the 
ADDR  label  shown  on  Figure  7-14  in  terms  of  program  address  labels.  The  utility 
builds  a  GPA  formatted  symbol  file  from  information  extracted  from  a  linked  COFF 
file.  The  GPA  file  must  be  transferred  to  the  analyzer,  this  is  best  done  using  a  LAN 
connection. 

Hewlett-Packard  has  further  extended  the  ability  of  their  analyzers  to  support 
source  level  debugging.  Their  B3740A  Software  Analyzer  tool  enables  trace  in¬ 
formation  to  be  displayed  at  source  level.  The  tool  runs  on  a  Unix  workstation  or  on  a 
PC  running  Windows.  An  HP16500B  logic  analyzer  must  be  connected  to  a  comput¬ 
er  system  via  an  HP16500L  LAN  card.  Once  the  analyzer  is  connected  to  the  LAN,  it 
can  be  controlled  from  the  workstation  or  PC.  For  example,  an  X-terminal  connected 
to  a  computer  running  Unix  can  use  the  Software  Analyzer  tool  to  display  program 
trace  information  in  terms  of  the  original  C  code.  For  convenience  it  is  also  possible 
to  display  the  equivalent  assembly  level  trace  normally  presented  on  the  dedicated 
analyzer  display.  Analyzer  trigger  logic  can  be  set  from  the  X-terminal  and  is  pres¬ 
ented  in  terms  of  address  symbols  rather  than  hexadecimal  values.  The  Software 
Analyzer  tool  currently  runs  with  HP16550A  (6-pod,  102  channel)  and  HP16555A 
(4-pod,  68  channel)  analyzer  cards  which  can  be  installed  in  the  HP16500B  logic 
analyzer  system. 

Hewlett-Packard’s  Software  Analyzer  tool  is  very  useful,  however,  it  does  suf¬ 
fer  from  displaying  trace  information  corresponding  to  instruction  fetch  activity 
rather  than  instruction  execution.  Not  all  fetched  instructions  which  are  observed  on 
the  system  bus,  flow  through  the  processor  pipeline  and  are  executed.  Instructions 
can  be  fetched  due  to  cache  block  reload  or  instruction  fetch-ahead.  The  Software 
Analyzer  indicates  that  these  instructions  have  been  fetched  for  execution  in  the  same 
way  as  instructions  which  really  are  executed  —  they  are  indistinguishable.  This 
problem  is  overcome  when  the  logic  analyzer  is  driven  by  the  MonTIP  program.  The 
UDI  conformant  MonTIP  has  been  extended  to  include  support  for  the  HP16500B 
logic  analyzer.  Algorithms  have  been  incorporated  within  MonTIP  for  processing 
trace  information.  These  algorithms,  described  in  more  detail  later,  are  able  to  elimi¬ 
nate  unwanted  trace  information  and  consequently  produce  trace  data  which  corre¬ 
sponds  to  the  execution  path  taken  by  the  processor. 

Traceable  Caching 

Traceable  Caching  is  accomplished  using  two  processors  in  tandem:  a  main  pro¬ 
cessor  and  a  slave  processor.  The  two  processors  are  connected  together,  pin-to-pin, 
except  the  slave  uses  its  address  bus  and  a  few  other  signal  pins  to  indicate  cache  hit 
activity.  The  main  processor  performs  all  the  required  operations,  and  the  tracing 
processor  duplicates  the  operation  of  the  main  processor  except  that  the  output  pins 
connected  in  parallel  are  disabled.  All  processor  outputs  to  the  system  are  driven  by 
the  main  processor.  The  slave  processor  simply  latches  the  results  of  the  accesses  per¬ 
formed  by  the  main  processor. 
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With  the  Am29040  processor,  the  address  bus _A31-A0  of  the  slave  (the  tracing 
processor)  along  with  output  pins  REQ,  R/W  and  ED  report  physical  branch  address¬ 
es  even  if  the  target  instruction  is  provided  by  the  on-chip  instruction  cache.  By  trac¬ 
ing  the  slave  processor  signals  along  with  the  master,  it  is  possible  to  exactly  recon¬ 
struct  the  sequence  of  instructions  executed.  Instruction  execution  is  considered  con¬ 
secutive  until  a  further  nonsequential  event  (such  as  a  branch  or  an  interrupt)  is  re¬ 
ported  by  the  slave  processor. 

When  a  load  or  store  hits  in  the  data  cache,  the  Am29040  slave  processor  pro¬ 
vides  the  corresponding  physical  address  on  its  address  bus.  The  slave  also  indicates 
when  a  data  access  results  in  cache  block  allocation.  When  an  instruction  executes, 
the  corresponding  processor  status  (signals  STAT2-STAT0)  are  reported  on  the  fol¬ 
lowing  cycle  —  when  the  instruction  is  in  the  write-back  pipeline  stage.  Load  and 
store  instructions  are  reported  in  the  same  way  as  other  instructions,  at  the  write-back 
stage,  rather  than  when  the  actual  data  transfer  is  accomplished. 

The  Am29040  and  Am29030  processors  perform  traceable  caching  at  the  inter¬ 
nal  speed  of  the  processor,  this  may  be  twice  the  speed  of  the  off-chip  memory  sys¬ 
tem.  This  ensures  that  the  processor  operation  can  always  be  fully  reported.  The 
Am29240  microcontroller  performs  traceable  caching  at  the  off-chip  memory  sys¬ 
tem  speed.  This  can  lead  to  difficulties  when  the  processor  is  running  internally  at 
twice  the  memory  system  speed.  For  example,  it  is  not  possible  to  report  the  target 
address  of  the  first  jump  in  a  back-to-back  sequence  of  jump  instructions  (instruc¬ 
tion  visiting).  Only  the  target  of  the  second  jump  is  reported  by  the  Am29240  slave 
processor.  Additionally,  if  a  branch  instruction  executes  in  the  same  memory  cycle  as 
a  load  or  store  instruction,  the  slave  only  reports  the  address  of  the  branch  instruction. 

Traceable  caching  is  enabled  via  the  JTAG  interface.  A  boundary-scan  instruc¬ 
tions  for  enabling  or  disabling  tracing  can  be  entered  via  the  JTAG  port.  Corelis  Inc. 
manufacture  preprocessor  boards  supporting  traceable  caching.  The  preprocessor 
contains  two  processors:  a  master  and  a  slave.  The  second  processor  is  switched  into 
slave-type  operation  during  reset  Active  components  on  the  preprocessor  board 
drive  a  TRACECACHE  instruction  into  the  slave  processor.  Around  the  perimeter  of 
the  Am29040  preprocessor  are  nine  logic  analyzer  connectors.  An  unusually  large 
number  of  analyzer  pods  is  required  due  to  the  need  to  trace  both  master  and  slave 
operation.  In  the  Am29040  case,  it  is  possible  to  operate  with  a  minimum  of  eight 
pods  if  the  optional  connector  J7  is  eliminated.  This  enables  tracing  to  be  accom¬ 
plished  with  a  minimum  of  two  HP16550A  or  two  HP16555A  logic  analyzer  cards 
inserted  into  an  HP16500B  analyzer  system. 

The  analyzer  cards  cards  should  be  connected  together  in  master  and  slave 
mode.  This  requires  physically  connecting  ribbon  cables  on  the  cards.  The  cards  can 
be  placed  anywhere  in  the  HP16500B  card  cage,  as  MonTIP  scans  for  their  actual 
location.  Assuming  two  HP16550A  cards  are  located  in  slots  D  and  E,  pod  El  (slot  E) 
should  be  connected  to  position  J1  on  the  Corelis  preprocessor,  and  pod  E2  to  J2,  and 


388 


Addendum  to  —  Evaluating  and  Programming  the  29K  RISC  Family 


October  13  1995,  Draft  1 


so  on.  Pods  D1-D3  must  be  connected  to  J7-J9.  The  POD_040._D  file  formats  the  D 
and  E  analyzer  cards  for  this  configuration.  The  file,  supplied  by  AMD  or  Corelis,  is 
normally  located  on  the  HP16500B  directory  /AMD/CONFIG/POD_040._D.  Note, 
a  different  POD_040._D  file  is  required  if  HP16555A  cards  are  used  rather  than  the 
lower  cost  HP16550A  cards.  The  HP16550A  card  has  a  4K  sample  memory  depth  (at 
full  channel  width),  the  HP16555A  can  store  1024K  samples. 

Processing  Trace  Information 

Enhancing  MonTIP  to  control  operation  of  the  HP16500B  logic  analyzer  offers 
a  number  of  advantages  to  the  software  engineer.  It  enables  a  UDI  conformant  debug¬ 
ger  to  access  the  analyzer.  This  makes  the  analyzer  usable  with  a  range  of  different 
Debugger  Front  Ends  (DFEs),  such  as  UDB  or  xray29u.  It  also  enables  trace  in¬ 
formation  to  be  processed  before  it  is  presented  to  the  DFE.  It  is  desirable  that  only 
the  execution  instruction  path  be  included  in  the  trace  data.  This,  after  all,  is  what 
software  developers  expect,  given  their  previous  experience  using  In-Circuit  Emu¬ 
lators  (ICE).  A  further  advantage  is  that  an  analyzer  can  be  combined  with  other  UDI 
conformant  debug  tools  to  produce  a  debug  environment  similar  to  that  achieved 
with  an  ICE. 

The  MonTIP  program  controls  the  logic  analyzer  and  processes  trace  informa¬ 
tion.  The  same  MonTIP  can  also  control  the  target  29K  system  via  commands  sent  to 
a  MiniMON29K  DebugCore.  The  operation  of  MonTIP  is  directed  by  the  chosen 
DFE.  The  user  enters  commands  to  the  executing  DFE  program.  When  the  DFE  is 
started  it  typically  initiates  the  operation  of  MonTIP.  When  started,  MonTIP  esta¬ 
blishes  communication  with  the  DebugCore  and,  via  a  LAN,  the  HP16500B  logic 
analyzer.  The  DFE  user  interface  will  appear  on  the  display,  along  with  the 
HP16500B  user  interface  which  is  requested  by  MonTIP.  In  addition  to  entering  DFE 
commands,  it  is  possible  to  enter  HP16500B  commands  directly  into  the  logic  ana¬ 
lyzer  window.  Note,  a  colour  terminal  simplifies  the  process  of  entering  analyzer 
commands. 

Using  the  logic  analyzer  window,  unprocessed  analyzer  trace  can  be  viewed. 
This  is  a  tedious  task,  particularity  when  the  29K  processor  is  operating  with  its  on- 
chip  caches  turned  on.  The  DFE  can  also  be  used  to  display  analyzer  trace  informa¬ 
tion,  but  this  time  in  a  fully  processed  format.  Only  instructions  which  actually 
execute  are  reported  in  the  trace  listing. 

The  format  of  the  displayed  processed  trace  is  dependent  on  whether  the  DFE 
has  been  extended  to  display  trace  information.  If  a  DFE  has  not  been  enhanced  to 
display  trace  in,  say,  source  format,  then  the  DFE  must  rely  on  MonTIP’s  ability  to 
prepare  trace  information  for  display;  this  is  achieved  using  a  transparent-mode  of 
operation,  which  is  described  shortly.  Bus  signals  selected  for  display  in  the  pro¬ 
cessed  trace  must  be  included  in  the  format  for  unprocessed  (raw)  trace.  However, 
they  need  not  actually  appear  in  the  analyzer  state  listing  window. 
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A  processed  trace  line  contains  the  instruction  which  was  in  write-back  during 
the  captured  trace  cycle,  or  data  which  was  transferred  during  the  cycle.  Put  another 
way,  if  an  instruction  is  fetched  from  memory,  then  during  its  write-back  cycle  (if  it 
reaches  execute)  the  op-code  is  reported  in  the  processed  trace.  Let’s  look  at  the  algo¬ 
rithm  used  with  the  Am29040  Traceable  Cache  preprocessor.  The  DATA  and  ADDR 
labels  have  their  values  changed  by  the  algorithm  to  reflect  the  instruction  which  was 
executed  during  the  traced  cycle.  The  DATA  and  ADDR  labels  in  the  raw  trace  indi¬ 
cate  the  instruction  which  was  fetched  during  the  traced  cycle  or  data  which  was  ac¬ 
cessed  during  the  cycle.  If  no  data  access  or  instruction  execution  occurs  in  a  cycle, 
then  there  is  no  processed  trace  line  corresponding  to  the  raw  trace  line.  MonTIP  only 
reports  lines  which  are  considered  valid. 

The  algorithm  operates  in  two  stages;  first  data  accesses  are  processed,  then 
instruction  flow  is  determined.  Data  accesses  are  examined  to  determine  if  there  are 
any  repeat  accesses  reported  due  to  the  use  of  Scalable  Clocking.  Trace  information 
is  captured  at  the  internal  processors  speed.  The  memory  system  may  be  running  at 
half  this  speed.  Consequently,  accesses  to  memory  are  captured  twice  in  adjacent 
trace  cycles.  Only  the  final  access  is  considered  valid. 

Data  transfer,  due  to  a  load  or  store  instruction,  can  occur  during  the  same  cycle 
another  instruction  is  executed.  When  this  happens,  the  algorithm  moves  the  report¬ 
ing  of  the  data  access  to  a  future  trace  cycle  which  contains  no  valid  trace  informa¬ 
tion.  If  another  data  transfer  occurs  before  the  previous  is  reported,  then  the  previous 
data  value  will  not  be  reported.  The  R/_W,  and  I/_D  information  is  repositioned 
where  necessary  and  possible,  so  as  to  report  data  accesses  which  occurred.  Note, 
LOADM  and  STOREM  data  transfers  are  reported  before  the  instruction  execution 
is  reported;  this  reflects  the  correct  operation  of  a  29K  processor.  Currently,  the  algo¬ 
rithm  is  being  enhanced  to  enable  multiple  instruction  execution  or  data  accesses  to 
be  reported  occurring  on  different  processed  trace  lines  which  correspond  to  the  same 
captured  trace  cycle.  This  eliminates  the  need  to  reposition  or  drop  data  accesses 
reporting.  These  algorithm  enhancements  are  required  by  superscalar  processors. 
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Figure  7-15.  Path  Taken  By  Am29040  Recursive  Trace  Processing  Algorithm 
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Data  accesses  which  generate  a  cache  hit  are  given  the  same  treatment  applied  to 
memory-resident  data  accesses.  When  a  data  transfer  occurs,  it  will  be  reported  in  the 
next  available  cycle  (could  be  the  current)  which  is  not  being  used  to  report  an 
instruction’s  execution  or  other  valid  data  access.  When  the  data  cache  is  turned  on,  it 
will  not  always  be  possible  to  report  the  value  of  the  data  transferred.  The  slave  pro¬ 
cessor  does  not  provide  the  cached  data  value,  only  the  address. 

For  vector  fetches,  the  vector  fetch  and  the  address  of  the  first  instruction  as  well 
as  the  VECT  status  are  reported  on  the  same  processed  trace  line. 

The  second  stage  of  the  algorithm  is  a  little  more  complicated;  it  produces  the 
complete  address-flow  for  executed  code.  It  currently  only  operates  with  32-bit 
memory  accesses.  Programs  should  not  be  traced  when  executing  from  8-bit 
memory  devices.  A  recursive  algorithm  determines  consecutive  instruction  execu¬ 
tion  sequences,  as  shown  on  Figure  7-15.  The  algorithm  starts  with  a  branch  instruc¬ 
tion  and  stops  when  it  reaches  a  delay-slot  instruction.  Branch  instructions  initiate 
new  instruction  sequences  for  the  algorithm  to  recursively  process. 

Once  the  address  flow  is  determined,  a  second  recursive  routine  determines  the 
instructions  which  correspond  to  the  address  flow.  Often  these  instructions  are 
fetched  from  memory  and  can  be  found  in  the  DATA  field  of  a  previous  trace  cycle. 
However,  if  the  instruction  is  supplied  by  the  instruction  cache  then  XXXXXXXX  is 
entered  into  the  DATA  column.  If  an  address  value  lies  in  the  loaded  TEXT  region 
and  the  DATA  column  is  marked  XXXXXXXX,  then  the  op-code  is  obtained  from 
the  loaded  COFF  file  and  placed  in  the  DATA  field. 

MonTIP  Commands 

Strictly  speaking,  commands  should  be  processed  by  the  Debugger  Front  End 
(DFE),  such  as  MonDFE.  However,  MonTIP  has  the  capability  of  also  processing 
commands.  The  range  of  commands  dealt  with  by  MonTIP  is  greatly  limited.  Each 
DFE  has  a  mechanism  by  which  its  command  processing  can  be  placed  in 
transparent-mode.  This  causes  commands  to  be  passed  to  the  TIP.  With  MonDFE, 
commands  begining  with  the  key  word  “tip”  are  passed  to  MonTIP.  A  number  of 
commands  have  been  added  to  MonTIP  to  support  analyzer  operation.  By  typing  the 
MonDFE  command  “tip  lahelp”  a  list  of  the  commands  will  be  displayed.  The 
MonTIP  man-page  describes  the  commands  in  more  detail. 

The  MonTIP  command  “latadd  label,  width”  is  used  to  add  a  column  to  the  trace 
listing  produced  by  MonTIP.  An  “latadd”  should  be  used  for  each  column  in  the 
processed  trace  listing.  Acceptable  values  for  “name”  are  defined  by  the  labels  which 
appear  in  the  raw  trace  listing.  The  only  exception  to  this  rule  is  for  labels 
SYMADDR  and  ASMDATA.  These  are  pseudo  labels  derived  from  the  raw  labels 
ADDR  and  DATA  respectively.  The  use  of  the  SYMADDR  label  causes  the 
hexadecimal  address  value  to  be  replaced  by  an  address  symbol.  For  this  to  work  the 
“lacoff  file”  command  must  be  used  to  specify  the  file  to  be  used  during  symbol  table 
look-up.  Addresses  which  are  not  found  in  the  COFF  file  are  presented  in 
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hexadecimal  format.  Use  of  the  ASMDATA  label  indicates  that  the  DATA  label 
information  should  be  disassembled  —  when  the  corresponding  address  is  known  to 
lie  in  a  TEXT  region. 

The  “latd  start,  end”  command  is  used  to  display  processed  trace  information 
based  on  stored  lines.  Processed  line  and  raw  line  numbers  are  the  same  with  regard 
to  the  processor  status  during  the  traced  processor  cycle,  but  only  valid  lines  appear  in 
the  processed  trace  listing.  A  valid  line  is  one  in  which  useful  processor  activity  was 
performed.  For  valid  lines,  the  ADDR,  DATA,  R/_W  and  I/  D  labels  are  reevaluated 
to  correspond  with  the  associated  processor  status  value. 

MonDFE  Trace  Access  Commands 

This  section  briefly  describes  MonDFE  commands  relating  to  displaying  trace 
information.  A  complete  list  of  MonDFE  commands  is  obtained  by  entering  the 
command  “?”.  MonDFE  supports  command  files  with  the  “zc  file”  command.  It  is 
useful  to  place  a  list  of  “tip  latadd  SYMADD”-type  commands  in  a  command  file 
such  as  la.rc.  This  enables  the  “zc  la.rc”  command  to  initialize  MonTIP  trace 
processing.  The  MonDFE  command  “ze  file”  can  be  used  to  record  displayed 
information  into  a  log  file. 

Before  a  trigger  can  occur,  trigger  conditions  must  be  installed  in  the  analyzer. 
The  command  “latrig  term,  label=pattem”  can  be  used  when  setting  trigger  patterns 
in  the  logic  analyzer.  Specifying  trigger  logic  and  sequence  control  must  be  specified 
using  the  analyzer  window.  Once  the  trigger  has  been  established  (and  the 
POD_040._D  setup  may  be  adequate)  it  is  usually  only  necessary  to  use  “latrig” 
commands  such  as  “tip  latrig  a,  ADDR=10004”.  Trigger  patterns  can  be  entered 
directly  using  the  analyzer  window,  this  also  requires  the  hexadecimal  patten  (rather 
than  symbol)  value  be  known.  If  the  swaf  program  is  used  to  build  symbol  table 
information  in  Hewlett-Packard’s  GPA  format,  it  can  be  directly  loaded  into  the 
analyzer.  This  enables  trigger  labels  to  be  set  using  symbols  when  directly  using  the 
analyzer  window. 

The  Analyzer  can  be  triggered  at  each  breakpoint  by  setting  the  break-address 
to  the  illegal  op-code  vector  address  (vector  0).  This  technique  is  useful  when 
breakpoints  are  implemented  by  temporarily  replacing  instructions  with  illegal 
op-code  instructions.  The  MiniMON29K  DebugCore  uses  this  technique  when 
on-chip  breakpoint  registers  are  not  available. 

UDB  Commands 

UDB  has  been  enhanced  by  CaseTools  to  support  displaying  trace  information 
in  source  format.  This  makes  UDB  a  preferred  tool  for  use  with  a  logic  analyzer. 
Additionally,  UDB ,  like  other  non  enhanced  source  level  debuggers,  can  also  be  used 
in  transparent-mode.  With  transparent-mode  operation,  it  is  possible  to  issue 
commands  for  MonTIP  processing.  Given  that  UDB  supports  source  level  tracing,  it 
is  unlikely  that  transparent-mode  operation  would  be  selected  for  use  with  UDB. 


392 


Addendum  to  —  Evaluating  and  Programming  the  29K  RISC  Family 


October  13  1995,  Draft  1 


However,  it  is  described  here  to  aid  users  of  other  source  level  debuggers  which  only 
have  access  to  transparent-mode  commands. 

UDB  is  a  window  based  debugger;  however,  command  line  processing  is 
supported.  When  in  the  Main  window  or  the  Console  window,  a  command  line 
sequence  begins  with  an  <ESC>  character.  For  example,  to  issue  a  “latadd”  MonTIP 
command,  use  the  command  sequence  “<ESC>ioctl  tip  latadd  label”. 

Alternatively,  the  “latadd  label”  command  can  be  directly  entered  at  the 
Console  window.  If  the  console  is  not  currently  gathering  input  for  an  out-standing 
standard  input  request  (such  as  a  scanf()),  the  keyboard  input  is  sent  to  MonTIP  for 
processing  rather  than  the  application  or  target  operating  system.  It  is  useful  to  place  a 
list  of  “ioctl  tip  latadd  SYMADD”-type  commands  in  a  command  file  such  as  la.rc. 
The  “<ESC>ioctl  tip  exec  la.rc”  command  can  then  be  used  to  process  the  MonTIP 
command  file.  An  example  la.rc  file  for  use  with  UDB  is  shown  below: 


ioctl  tip  latclr; 
ioctl  tip  latadd  LINE; 
ioctl  tip  latadd  ADDR; 
ioctl  tip  latadd  SYMADDR; 
ioctl  tip  latadd  ASMDATA; 
ioctl  tip  latadd  R/_W, 6; 
ioctl  tip  latadd  *STAT_, 6; 
ioctl  tip  lamore  20; 


The  trace  listing  produced  by  commands  such  as  “<ESC>ioctl  latd  0,  20”  will 
appear  in  the  console  window  along  with  any  other  console  output  information  such 
as  printf()  output.  It  is  also  convenient  to  use  UDB’s  macro  instruction  capability  to 
bind  macros  to  buttons  associated  with  the  console  frame.  This  allows  user  defined 
buttons  (left  side  of  frame)  to  be  simply  clicked  to  issue  the  required  MonTIP 
command.  The  macro  instructions  shown  below  can  be  placed  in  the  udb.rc  startup 
file. 


macro 

m=mcon 

-f 

-"echo" 

{ lb6 } 

" { com} stty 

techo\r" 

macro 

m=mcon 

-f 

-"trig" 

{ lb7 } 

" { com} ioctl 

tip 

latrig  a,  ADDR— 

macro 

m=mcon 

-f 

-"sync" 

{ lb  8 } 

" { com} ioctl 

tip 

lasync\r" 

macro 

m=mcon 

-f 

-"la . rc" 

{ lb9 } 

"  { com} exec 

la . rc\r" 

macro 

m=mcon 

-f 

-"cof f " 

{ lblO } 

" { com} ioctl 

tip 

lacoff  " 

macro 

m=mcon 

-f 

-"help" 

flblTt 

" { com} ioctl 

tip 

lahelp\r" 

macro 

m=mcon 

-f 

-"latd" 

{ lbl2 } 

" { com} ioctl 

tip 

latd  " 

Unprocessed  analyzer  trace  was  shown  on  Figure  7-14.  The  corresponding  pro¬ 
cessed  trace  is  shown  on  Figure  7-16.  Trace  information  is  presented  in  the  console 
frame  using  UDB  in  a  transparent-mode  of  operation.  Although  the  Console  window 
is  adequate,  it  is  easier  to  study  program  execution  from  the  enhanced  trace  window. 

The  enhanced  trace  window  is  shown  on  Figure  7-17.  This  window  appears 
when  the  trace  view-toolbar  button  (bottom  right  of  window)  is  selected.  The  trace 
listing  window  is  formatted  via  the  “trcol”  command.  This  command,  along  with  a 
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number  of  other  trace  modifying  commands,  can  be  interactively  entered  at  the  UDB 
command  line.  However,  it  is  much  more  typical  to  arrange  for  “trcol”  commands  to 
be  processed  during  UDB  start-up.  This  is  accomplished  by  entering  a  “trcol”  com¬ 
mand  sequence,  such  as  the  following  example,  into  the  udb.rc  file. 


trcol 

-d 

-w  8 

ADDR 

trcol 

-d 

-w  15 

SYM 

trcol 

-d 

-w  8 

DATA 

trcol 

-d 

-w  27 

DASM 

trcol 

-d 

-w  6 

TYPE 

trcol 

-d 

-w  6 

STAT 

UDB  fetches  new  trace  information  when  the  Fetch  button  is  pressed.  This 
button  should  be  used  whenever  the  analyzer  has  acquired  new  trace  data.  The  Start 
and  Stop  buttons  are  provided  for  remote  Running  and  Halting  of  the  analyzer  data 
acquisition.  This  is  equivalent  to  using  the  top-right-hand  comer  button  on  the 
analyzer  display.  The  Line  button  is  used  for  displaying  a  desired  line  number  — 
entered  via  a  dialog  box.  The  Top  button  moves  the  current  line  (indicated  by  the 
cursor  position)  to  the  top  of  the  display. 

The  Set  button  is  particularly  useful.  The  selected  line  is  highlighted  in  red  and 
the  raw  analyzer  display  is  adjusted  as  necessary  to  show  the  corresponding  raw  trace 
line.  If  any  source-display  windows  are  opened  (in  up-date  mode  rather  than  edit 
mode),  they  are  adjusted  to  show  and  highlight  (in  red)  the  corresponding  C  source 
line.  The  Loc  button  can  be  used  to  relocate  the  current  highlighted  line. 

After  a  trace  line  has  been  selected  and  the  Set  button  applied,  the  Next  and  Prev 
buttons  can  be  used  to  single  step  through  source  level  trace.  Using  the  Next  button, 
all  three  displays  (if  in  use)  will  be  updated  with  the  next  line  corresponding  to  the 
recorded  source  execution.  If  the  shift  key  is  held  down  while  using  the  Next  or  Prev 
button,  assembly  level  stepping  rather  than  source  level  stepping  is  performed. 

Trace  column  SYM  is  a  synonym  for  column  ADDR  formatted  symbolically. 
The  address  symbols  are  taken  from  the  loaded  symbol  file.  Hence  the  need  to  load  a 
UDB  symbol  file  produced  by  the  mksym  utility.  The  DASM  column  is  a  synonym 
for  DATA  presented  in  disassembly  format.  When  an  instruction  is  supplied  by  the 
on-chip  cache,  XXXXXXXX  is  placed  in  the  DATA  column.  However,  the  Xs  will 
be  replaced  with  the  actual  instruction  if  an  executable  program  has  been  loaded.  For 
more  information  on  preparing  the  trace  display  see  section  7.9.3. 
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Figure  7-16.  UDB  Console  Window  Showing  Processed  Trace  Information 
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7.9  Fusion3D  TOOLS 

Shortened  development  times  and  increased  product  complexity  has 
necessitated  the  use  of  powerful  software  development  tools.  Unfortunately 
however,  the  higher  processor  speeds  and  on-chip  integration  provided  by  many  of 
the  newer  embedded  RISC  processors  has  led  to  an  increased  cost  associated  with 
traditional  debug  tools  such  as  In-Circuit  Emulators  (ICE).  Additionally,  the  rapid 
changes  occurring  in  the  embedded  processor  market  and  the  frequent  introduction 
of  processor  variations  has  placed  emphasis  on  the  need  for  tool  reusability. 
Reusable  and  low  cost  tools  have  a  broad  appeal  among  software  designers. 

The  term  Fusion3D  refers  to  a  Distributed  Design  and  Debug  environment.  The 
purpose  behind  Fusion3D  is  to  provide  cost  effective  design  and  debug  tool 
alternatives  selectable  from  a  range  of  compatible  products.  This  is  achieved  by 
distributing  the  primary  tool  functions.  For  example,  traditionally  a  full-function 
ICE  has  been  chosen  as  the  primary  debug  tool.  However,  the  overlay  or  substitute 
target  memory  provided  by  an  ICE  is  alternatively  available  with  a  ROM  emulator; 
the  ICE’s  tracing  capability  can  be  effectively  achieved  with  a  logic  analyzer;  and 
controlling  program  execution  can  be  accomplished  with  a  debug  monitor  and 
on-chip  debug  support  hardware.  The  capabilities  inherent  in  a  full-function  ICE  are 
distributed  among  the  selected  Fusion3D  components.  AMD  has  identified  and 
worked  with  key  Fusion3D  partners  to  bring  together  the  necessary  components  of 
the  Fusion3D  environment. 

The  Fusion3D  approach  is  flexible.  The  scalable  nature  of  Fusion3D  enables  the 
software  developer  to  construct  a  debug  environment  which  is  adequate  for  the  task 
to  be  undertaken,  yet  does  not  incur  the  high  costs  typically  associated  with  a 
full-function  ICE.  At  a  later  stage,  if  a  project  requires  an  additional  debug 
capability,  the  chosen  tool  combination  can  be  enhanced. 

Many  of  the  tools  provided  by  the  Fusion3D  program  are  useable  with  any 
member  of  the  29K  family  or  other  processor  family  such  as  the  X86.  This  helps 
reduce  the  cost  associated  with  tooling-up  for  a  new  project.  For  example,  the 
HP  1 6500B  logic  analyzer  is  widely  used  within  the  industry.  Traditionally  it  has  been 
used  by  hardware  development  engineers.  Extending  its  utility  as  a  software 
development  tool,  useable  across  a  wide  range  of  processors,  is  very  cost  effective. 

7.9.1  NetROM  ROM  Emulator 

A  ROM  emulator  is  used  to  replace  a  system’s  ROM  or  SRAM  type  memory 
devices  with  substitute  memory.  Typically,  ROM  devices  are  removed  from  socket 
locations  on  the  target  system  and  a  cable  used  to  connect  the  ROM  emulator  to  the 
vacated  sockets.  The  processor  can  read  the  emulated  memory  as  if  it  were  real 
ROM.  Occasionally  there  may  be  differences  in  memory  access  times  due  to 
different  memory  access  wait  states;  but  essentially  the  system  runs  as  normal. 
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ROM  emulators  always  provide  a  second  access  port  to  the  emulated  memory. 
Via  this  second  port,  the  contents  of  the  memory  can  be  read  or  written.  This  is 
generally  accomplished  by  a  host  computer  to  which  the  ROM  emulator  is  attached. 
The  technique  enables  programs  to  be  installed  in  system  memory  without  the  need 
to  prepare  (often  termed  bum)  new  ROM  devices.  During  the  process  of  developing 
and  debugging  software,  modification  of  the  program  code  frequently  occurs. 
Consequently,  an  updated  program  must  be  reinstalled  in  the  target  system’s 
memory.  The  process  of  preparing  new  ROMs  is  slow,  and  a  ROM  emulator  with  a 
fast  computer  link  provides  an  alternative  means  of  updating  the  system  memory. 

NetROM  is  a  ROM  emulator  product  provided  by  a  Fusion29K  tool 
development  partner.  It  can  emulate  8-bit,  16-bit  or  32-bit  wide  memory  devices  as 
required.  Depending  on  the  width  (number  of  bits)  of  the  memory  being  emulated, 
between  one  and  four  cables  are  required  to  connect  the  NetROM  to  the  target  system 
memory.  Up  to  1M  byte  of  memory  can  be  emulated,  depending  on  the  pin  layout  of 
the  memory  devices  in  use.  The  1M  byte  limitation  does  not  restrict  NetROM’s  use 
for  developing  programs  which  are  larger  than  1M  byte  —  this  is  achieved  via  the 
on-board  UART.  The  UART  is  mapped  into  a  location  within  the  emulated  memory 
space.  The  29K  processor  can  exchange  data  with  the  UART.  The  host  computer  can 
also  access  the  UART  and  hence  exchange  information  with  the  29K  processor. 

The  MiniMON29K  bundle  contains  a  driver  for  the  NetROM  UART  (often 
referred  to  as  a  virtual  UART).  This  enables  the  TIP  program  (MonTIP)  running  on 
the  host  computer  to  communicate  with  the  DebugCore  software  running  on  the  29K 
target  system.  The  method  enables  programs  to  be  downloaded  into  the  target 
systems  DRAM  memory.  Typically,  OS-boot,  the  DebugCore  and  support  drivers 
are  placed  in  emulation  memory;  then,  via  MonTIP  support,  programs  are  loaded  and 
executed  out  of  the  target  system’s  DRAM  memory. 

The  NetROM  equipment  connects  to  the  host  computer  via  an  Ethernet 
connection.  A  NetROM  can  be  used  with  an  IBM  PC  compatible  machine  running 
Windows;  however,  because  of  their  Ethernet  connection,  they  are  much  more 
frequently  used  with  networked  Unix  based  systems.  The  Unix  machine  serving  the 
NetROM  will  have  an  entry  in  (typically)  its  /etc/bootptab  file,  specifying  the  IP  and 
Ethernet  addresses  allocated  to  the  NetROM.  Also  specified  in  the  bootptab  file  is  the 
the  path  to  the  NetROM  configuration  file.  An  example  bootptab  file  entry  for  a 
CMU  type  server  is  shown  below.  The  actual  NetROM  configuration  file  is 
/tftpboot/netrom/startup2.bat.  Note,  for  servers  running  in  “secure”  mode,  the 
/tftpboot  directory  must  be  at  the  root  of  the  path  to  the  NetROM  configuration  file. 
The  NetROM  (client)  Ethernet  hardware  address  is  given  by  the  “ :  ha=”  field. 

netrom2 : hd=/ tftpboot/ netrom : bf =startup2 . bat : sm=255 .255.255.255: 
ht=l :ha=00402f008444 : ip=163 .181.22.60 

When  the  host  server  connects  to  the  NetROM  client,  the  configuration  file  is 
downloaded  into  the  NetROM.  A  portion  of  the  /tftpboot/netrom/startup2.bat  is 
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shown  below.  The  loadfile  and  loadpath  variables  are  used  to  specify  the  default 
image  file  to  load  into  emulation  memory.  In  the  example  below,  the  default  image 
file  is  /tftpboot/netrom/target/sa29040.hex.  The  key  parameters  in  the  configuration 
file  should  be  arranged  to  describe  the  type  of  memory  being  emulated.  The  example 
below  shows  four  27c020  memory  devices  combined  to  produce  a  32-bit  memory 
system.  This  will  require  four  NetROM  cables.  It  is  possible  (and  common)  to 
emulate  only  an  8-bit  wide  memory  system. 


;  part  of  the  startup2.bat  NetROM  configuration  file 


setenv 

host 

163 .181.22 . 9 

; server  IP  address 

setenv 

loadfile 

sa2  9040 . hex 

;29K  program  (image  : 

setenv 

loadpath 

/ netrom/target 

;path  to  29k  program 

setenv 

romtype 

27c020 

; ROM  type 

setenv 

romcount 

4 

; number  of  ROMs 

setenv 

podorder 

0  : 1 :  2  :  3 

;pod  order 

setenv 

wordsize 

32 

; memory  width 

A  NetROM  can  support  TELNET  and  direct  TCP  connections  simultaneously. 
The  MonTIP  program  forms  a  direct  connection  to  the  NetROM  via  the  parameter 
information  located  in  the  udi_soc  file  (see  section  7.5.6).  An  example  udi_soc  entry 
is  shown  below. 

#  udi_soc  file  entry  to  support  NetROM 

netrom2  AF_UNIX  soc  montip  -t  netrom  -netaddr  163.181.22.60  -netport  1234 

It  is  possible  to  have  a  TELNET  session  active  with  a  NetROM  while  also 
running  MonTIP.  Of  course,  the  user  controls  the  NetROM  via  a  front-end  debug 
tool  such  as  UDB  which  directs  the  operation  of  MonTIP  via  the  UDI  interface.  From 
a  window  running  the  TELNET  command  “telnet  netrom 2“  (for  example),  the 
“newimage”  NetROM  command  can  be  used  to  download  a  file  (usually  the  default) 
into  emulation  memory.  The  sa29040.hex  image  file  contains  OS-boot,  the 
DebugCore,  and  support  driver  code  for  an  SA29040  evaluation  board.  Once 
installed  it  enables  DebugCore  messages  to  be  exchanged  between  the  29K  target 
and  the  host  computer  running  MonTIP. 

A  software  reset  can  be  performed  by  issuing  a  reset  command  from  UDB. 
Normally  the  DebugCore  is  successfully  running  and  will  perform  the  reset.  Under 
extreme  conditions  the  DebugCore  may  no  longer  be  in  control  of  the  29K  processor. 
In  this  case  a  hardware  reset  can  be  performed.  This  requires  that  the  29K  reset  pin  be 
asserted.  From  the  TELNET  session  this  is  accomplished  via  the  “tgtresef’ 
command.  The  technique  requires  that  a  reset  wire  be  used  to  connect  the  reset  output 
pin  on  the  back  of  the  NetROM  (marked  R)  to  a  connection  post  on  the  target  system. 
The  connection  post  must  be  wired  to  the  processor  reset  pin.  For  this  reason,  it  is  best 
to  incorporate  a  reset  connection  post  on  each  29K  target  system  for  use  by  the 
NetROM. 

Once  a  NetROM  has  been  added  to  a  network,  a  TELNET  connection  can  be 
used  to  confirm  its  correct  installation.  After  issuing  a  “newimage”  command,  and 
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possibly  a  “tgtrest”  command,  the  29K  target  system  is  ready  for  operation.  The 
chosen  debug  tool  (UDB,  GDB,  etc.)  can  then  be  invoked  and  used  to  examine, 
modify  and  control  the  target  29K  processor  in  the  normal  way.  Once  correct 
installation  has  been  confirmed,  there  is  no  need  to  first  establish  a  TELENT 
connection  before  initiating  normal  program  debug.  All  that  is  necessary  is  to  start 
execution  of  the  chosen  debugger. 

The  NetROM  driver  (for  the  29K  side  of  the  virtual  UART)  that  is  built  into  the 
image  file,  typically  operates  in  poll-mode.  This  refers  to  the  29K  processor  on 
occasion  polling  the  UART  to  determine  if  it  is  receiving  a  message  from  MonTIP. 
The  image  file  can  be  built  with  an  interrupt-mode  driver.  This  enables  MonTIP  to 
interrupt  the  29K  at  any  time  (if  interrupts  are  enabled)  when  it  wishes  to  send  a 
message  (such  as  halt)  to  the  DebugCore.  To  enable  operation  of  this  technique,  an 
interrupt  wire  must  be  used  to  connect  the  interrupt  output  pin  on  the  back  of  the 
NetROM  to  an  interrupt  input  post  on  the  29K  system.  Once  again,  the  post  should  be 
incorporated  into  any  design  which  wishes  to  make  sue  of  a  NetROM. 

7.9.2  HP16500B  Logic  Analyzer 

Network  Installation 

The  use  of  logic  analyzers  for  tracing  program  execution  was  previously 
presented  in  section  7.8.  This  section  briefly  deals  with  the  details  of  configuring  the 
logic  analyzer’s  operation  for  use  in  the  Fusion3D  environment.  A  high  speed 
connection  to  the  analyzer  is  achieved  via  the  optional  Ethernet  link.  This  requires 
that  the  analyzer  be  allocated  a  unique  IP  address.  Using  the  analyzer’s 
communications  set-up  window,  the  IP  address  is  recorded  for  future  use.  With  Unix 
networks,  the  IP  address  and  chosen  name  are  entered  into  the  network  database  file 
/etc/hosts.  The  following  example  allocates  an  IP  address  for  an  analyzer  called 
“hpla”. 


tentries  in  /etc/hosts  file  allocating  IP  addresses 
163.181.22.117  hpla  #  logic  analyzer 

163.181.22.121  ginger  #  X-terminal 

The  analyzer  connection  can  be  confirmed  by  establishing  a  telnet  connection. 
This  is  accomplished  with  a  “telnet  hpla  5025”  command.  Port  number  5025  enables 
access  to  the  analyzer  command  parser.  Commands  can  then  be  directly  issued  to  the 
analyzer.  One  very  useful  command:  “xwin  on,  ’  163.181.22.121:0.0’”,  establishes  a 
remote  window  interface  to  the  analyzer.  The  example  command  shown  causes  an 
analyzer  front  panel  interface  to  be  presented  on  the  display  determined  by  the  IP 
address  163.181.22.121.  Checking  the  example  /etc/hosts  file,  it  appears  to  be  an 
X-terminal  known  to  the  network  by  the  name  ginger.  It  is  important  that  the  X  server 
allow  the  analyzer  to  make  connection  to  the  server.  The  “xhost  +  hpla”  command 
can  be  used  to  add  hpla  to  the  list  of  machines  that  are  allowed  to  make  connection  to 
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the  X  server.  To  obtain  the  name  of  your  terminal’s  display,  print  the  environment 
variable  DISPLAY  as  shown  below. 


echo  $DISPLAY  #Unix  shell  command 

ginger: 0.0  #response 

It  is  important  for  the  successful  operation  of  MonTIP  that  the  environment 
variable  DISPLAY  be  correctly  initialized.  Note  that  some  HP  workstations  set  the 
variable  to  the  value  “local:0.0”,  this  does  not  create  any  difficulty  for  MonTIP. 

UDI  Installation 

The  udi_soc  file  (for  Unix  based  systems)  must  contain  an  entry  for 
establishing,  via  UDI,  the  MonTIP  to  analyzer  connection.  The  MonTIP  option  “-la 
name”  is  provided  for  this  purpose.  The  example  below  shows  a  udi_soc  entry  for  a 
session  identified  by  the  name  “trace”.  Note  that  the  udi_soc  file  format  was 
described  in  detail  in  section  7.5.6.  If  a  logic  analyzer  were  being  used  alone,  the 
example  udi_soc  entry  would  be  adequate.  However,  a  NetROM  is  typically 
combined  with  an  analyzer.  In  this  case  the  two  entries  shown  below  would  be 
combined  to  produce  a  single  entry  with  a  unique  session  identifier. 

#  udi_soc  file  entry  to  support  logic  analyzerl 
trace  AF_UNIX  soc  montip  -la  hpla 

rom  AF_UNIX  soc  montip  -t  netrom  -netaddr  163.181.22.60  -netport  1234 
X  UDI  session_id 

When  using  the  UDB  source  level  debugger  to  control  a  logic  analyzer,  a 
mktarget  command  must  be  placed  in  the  udb.rc  start-up  command  file.  As 
explained  in  section  7.7,  a  GIO  process,  controlled  by  UDB,  uses  the  assigned 
mktarget  parameters  to  connect  to  a  29K  target  (in  this  case  via  MonTIP).  An 
example  udb.rc  entry  is  shown  below. 


#  udb.rc,  UDB  startup  command  file 

#driver  args  (GIO  ID,  GIO  executable,  exec,  flags,  udi_soc  session  ID) 
tmktarget  name  id  type  driver  (args....) 

mktarget  LA  1  29040  dr_gio  0  ios_udi  -be  trace 

executable  ^  UDI  session_id 

Note  that  normally  the  GIO  and  UDB  processes  determine  the  endian  of  the  29K 
target  via  examining  the  processor’s  CFG  special  register.  When  an  analyzer  is  used 
alone,  there  is  no  connection  to  the  29K  processor  and  the  CFG  register  can  not  be 
accessed.  This  necessitates  that  the  mktarget  command  specify  the  target  endian. 
The  “-be”  switch  is  used  in  the  example  to  select  big-endian  operation.  The  “-le” 
switch  is  available  for  selecting  little-endian.  The  following  section  7.9.3  describes 
how  user  defined  buttons  can  be  used  to  issue  mktarget  commands. 

Accessing  the  Analyzer  File  System 

It  is  very  convenient  to  be  able  to  drive  the  logic  analyzer  remotely  from,  say,  the 
X-terminal  on  your  desk.  Note  that  a  colour  monitor  is  required  to  achieve  full 
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control  of  the  analyzer.  As  described  above,  remote  control  of  the  analyzer  is  enabled 
via  the  “xwin  on”  command.  When  remote  control  of  the  analyzer  is  no  longer 
required,  the  command  “xwin  off’,  entered  via  the  telnet  connection  to  the  analyzer, 
discontinues  the  remote  display.  Connection  to  the  analyzer  command  parser  is 
broken  when  the  TELNET  session  is  terminated. 

Only  one  user  can  be  in  control  of  the  analyzer  at  any  time.  This  means  the 
analyzer  can  not  be  driven  from  the  front  panel  when  a  remote  window  is  active. 
When  MonTIP  controls  the  analyzer,  it  requests  a  remote  window  be  presented  on  the 
MonTIP  host  computer  (actually,  the  DISPLAY  variable  identifies  the  screen). 
Consequently,  it  is  not  possible  for  another  user  to  establish  a  second  remote  window 
connection.  However,  it  is  possible  to  simultaneously  have  an  FTP  connection  active 
when  remotely  controlling  the  analyzer.  The  example  command  sequence  below 
demonstrates  how  this  is  achieved. 

1%  ftp  hpla  #Unix  shell  command 

Connected  to  hpla. 

220  HP16500B  V01.00  FUSION  FTP  server  (Version  3.3)  ready. 

Name  (hplardanm) :  data 

230  User  DATA  logged  in. 

ftp>  cd  system/disk/hard/amd/danm 

200  Remote  Directory  changed  to  "/system/disk/hard/amd/danm". 
ftp> 

When  entering  a  login  name,  the  identifier  “data”  was  used  in  the  above 
example.  This  enables  read  access  to  files  located  on  the  analyzer  disk  system. 
Entering  the  identifier  “control”  enables  read-write  access  to  the  file  system. 
However,  logging  in  as  “control”  is  not  permitted  if  another  user  is  identified  as 
already  controlling  the  analyzer.  Files  can  be  transferred  from/to  the  analyzer  using 
the  FTP  commands  get/put;  remember  you  may  have  to  first  use  the  binary  command 
to  enable  transfer  of  binary  data  files. 

Triggering  the  Analyzer  for  Trace  Capture 

The  HP16500B  logic  analyzer  is  equipped  with  a  very  sophisticated  triggering 
capability.  Hence,  debuggers  controlling  the  logic  analyzer  tend  to  rely  on  the 
analyzer’s  triggering  logic.  When  using  the  PI-Am29040  preprocessor,  the 
POD_040  configuration  file  prepares  the  analyzer  for  triggering  on  access  to  the 
memory  location  described  by  analyzer  trigger  term  A.  This  may,  or  may  not,  be 
adequate  for  your  triggering  requirements.  All  changes  to  trigger  logic  must  be 
entered  using  the  logic  analyzer  front  panel  display  (remotely  if  desired).  If  using  the 
POD_040  file,  all  that  is  necessary  is  to  supply  the  trigger  address  in  the  ADDR  field 
of  term  A.  Of  course  the  address  must  be  entered  in  hexadecimal  format  unless  a 
symbol  file  has  been  loaded  into  the  analyzer. 

When  on-chip  caches  are  used,  instruction  and  data  accesses  may  not  always 
appear  on  the  processor  bus.  This  complicates  the  task  of  triggering  the  analyzer.  It  is 
also  not  possibly  to  simply  use  the  ADDR  field  of  term  A  when  a  microcontroller  is 
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being  used  and  hence  the  full  32-bit  address  value  is  not  observable,  even  if  an 
off-chip  memory  access  is  performed.  (Microcontrollers  divide  the  address  space 
into  regions,  only  the  lower  address  bits  for  any  particular  region  may  appear  on  the 
microcontroller  address  bus.)  Dealing  with  these  problems  can  require  the  user  to  be 
creative  when  developing  alternative  triggering  logic. 

With  processors  which  have  on-chip  breakpoint  control  registers,  a  SYNC 
pulse  can  be  generated  when  a  specified  data  or  instruction  access  occurs.  The 
analyzer  trigger  logic  can  be  configured  to  trigger  on  the  occurrence  of  the  SYNC 
pulse.  Alternatively,  for  processors  without  breakpoint  control  registers,  a  simple 
arrangement  can  be  used  to  trigger  the  analyzer  when  any  execution  breakpoint  is 
taken:  When  a  breakpoint  is  taken,  the  illegal  opcode  trap  is  taken  (trap  number  zero). 
The  analyzer  should  be  set  to  trigger  on  a  read  of  the  first  entry  in  the  vector  table.  The 
address  is  specified  by  the  contents  of  special  register  VAB  (Vector  Area  Base). 

For  convenience,  UDB  provides  a  remote  method  of  entering  data  into  the  logic 
analyzer  trigger  setup.  Using  the  “trigterm”  command  shown  below,  trigger  patterns 
can  be  specified  for  different  labels  and  patterns. 

trigterm  <term>  <label>  (<pattern>  |  <address>} 

Normally,  one  simply  specifies  a  <pattern>  for  a  label.  The  format  of  the  pattern 
is  assumed  to  be  hexadecimal  unless  a  base  is  explicitly  specified.  However,  in  the 
case  where  the  <label>  is  ADDR,  then  an  <address>  should  be  provided  instead;  and 
UDB  will  convert  the  address,  which  may  be  specified  as  a  symbol,  into  a 
hexadecimal  string  of  eight  characters. 

To  further  simplify  issuing  “trigterm”  commands,  a  Trig  button  has  been  added 
to  the  View,  Var  and  Dasm  frames.  In  the  View  frame,  clicking  on  a  line  and  then 
clicking  the  Trig  button  will  set  term  A  of  the  ADDR  column  to  the  address  of  the 
source  line  the  cursor  is  currently  on.  Clicking  on  a  variable  and  then  shift-clicking 
the  Trig  button  will  set  term  A  of  the  ADDR  column  to  the  address  of  the  variable  the 
cursor  is  currently  on.  This  only  works  in  the  case  where  the  variable  is  allocated  to  a 
memory  location  and  not  held  in  an  on-chip  register. 

In  the  Dasm  frame,  clicking  on  a  line  and  then  clicking  the  Trig  button  will  set 
term  A  of  the  ADDR  column  to  the  address  of  the  disassembly  line  the  cursor  is 
currently  on. 

Searching  Through  Trace  Data 

The  HP16500B  logic  analyzer  provides,  via  the  front  panel  display,  a  means  of 
searching  for  patterns  in  the  captured  trace  data.  However,  without  symbolic  address 
support,  and  given  the  fact  that  raw  trace  data  is  not  limited  to  just  the  execution 
stream,  it  is  often  more  convenient  to  search  for  patterns  in  the  processed  trace  data. 
UDB  provides  support  for  trace  searching  with  the  “trsearchnext”  and 
“trsearchprev”  commands.  The  command  format  is  shown  below: 
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trsearchnext  [<label>  {<pattern>  |  <address>}] 
trsearchprev  [<label>  {<pattern>  |  <address>}] 

Normally,  one  specifies  a  search  <pattem>  for  a  selected  <label>.  However,  in 
the  case  where  the  <label>  is  ADDR,  then  an  <address>  should  be  provided  instead. 
In  such  case,  UDB  will  convert  the  address,  which  may  be  specified  as  a  symbol,  into 
a  hexadecimal  string  of  eight  characters. 

After  a  <label>  and  <pattern>  have  been  specified,  UDB  remembers  them  to 
allow  for  further  searching  without  having  to  specify  the  <label>  and  <pattern> 
again.  In  particular,  the  Next  and  Prev  keys  in  the  trace  frame  have  been  overloaded, 
such  that  Ctrl-Shift-Clicking  them  will  issue  these  commands  with  no  parameters. 

UDB  supports  the  binding  of  buttons  to  macro  commands.  This  is  a  convenient 
means  of  issuing  “trsearchnext”  commands.  The  following  udb.rc  command 
sequence  assigns  buttons  to  the  macro  table  associated  with  the  Trace  frame.  Note 
that  user  programmed  buttons  should  be  restricted  to  the  left  hand  side  of  the  window. 
The  example  command  creates  two  buttons.  The  next  button  can  be  used  initiate  a 
“trsearchnext”  command.  Because  the  command-string  does  not  finish  with  a  “\r” 
character,  the  user  can  enter  the  <pattem>  from  the  command  line  interface. 

#  macro  table  button  position  command 

#  -  -  -  -  .  .  . 

macro  m=mtrace  -f "next"  { lbl }  "{ com} trsearchnext  ADDR  " 
macro  m=mtrace  -f  "prev"  { lb2 }  "{ com} trsearchprev  ADDR  " 

7.9.3  Selecting  Trace  Signals 

In  section  7.8  under  the  headings  MonTIP  Commands  and  UDB  Commands , 
techniques  for  formatting  the  trace  display  were  presented.  Groups  of  processor  sig¬ 
nals,  such  as  the  address  bus,  are  grouped  together  and  assigned  labels.  The  user  can 
always  rely  on  the  following  four  labels  being  available  for  display:  ADDR,  DATA, 
LINE,  TYPE.  Different  29K  family  members,  and  different  system  configurations, 
will  provide  a  number  of  other  useful  labels,  such  as  R/_W.  Given  the  limited  size  of 
the  trace  display,  it  is  necessary  to  limit  the  number  of  trace  labels. 

Depending  on  the  source  level  debugger  selected,  or  if  MonTIP  is  being  used  to 
format  the  trace  display,  there  may  be  synonyms  for  the  main  trace  labels.  For  exam¬ 
ple:  ADDR  is  also  known  as  SYM  by  UDB,  and  as  SYMADDR  by  MonTIP.  These 
alternatives  to  ADDR,  enable  the  traced  address  values  to  be  presented  symbolically, 
even  if  the  logic  analyzer  is  configured  to  display  them  in,  say,  hexadecimal.  An  alter¬ 
native  implementation  would  have  been  to  support  a  format  parameter  for  control¬ 
ling  the  displaying  of  selected  labels.  But,  so  far,  this  has  not  been  the  route  taken  by 
source  level  debugger  implementors.  Note  that  if  the  SYM  label  is  selected,  it  is  nec¬ 
essary  to  load  a  UDB  symbol  file.  This  file  is  produced  by  the  mksym  utility. 

The  DASM  column  is  a  synonym  for  DATA  presented  in  disassembly  format. 
The  trace  processing  algorithms  used  with  the  Am29040  processor  place 
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XXXXXXXX  in  the  DATA  column  when  an  instruction  is  supplied  by  the  on-chip 
cache.  However,  the  Xs  will  be  replaced  with  the  actual  instruction  if  an  executable 
program  has  been  loaded.  For  replacement  to  be  successful,  the  Am29040  target 
processor  must  be  executing  with  physical  addressing  or  with  a  one-to-one  virtual  to 
physical  address  translation  scheme.  This  is  because  the  Am29040  slave  processor 
produces  physical  address  values.  The  virtual  addresses  inherent  in  the  loaded 
program  must  correspond  to  the  physical  addresses  appearing  on  the  processor 
address  bus.  The  Am29460  slave  processor  produces  virtual  addresses,  but  this  does 
not  entirely  solve  the  problems  created  by  the  use  of  address  translation.  I  will  say 
more  about  this  in  section  7.9.5. 

As  explained  in  section  7.8,  the  trace  listing  frame  is  formatted  with  the  “trcol” 
command.  This  is  accomplished  by  entering  a  “trcol”  command  sequence,  such  as 
the  following  example,  into  the  start-up  udb.rc  file.  The  “-w”  parameter  specifies  the 
maximum  display  width  (in  characters)  for  a  label.  The  udb.rc  file  is  accessed  from 
the  current  working  directory  or  from  your  home  directory. 


trcol 

-d 

-w 

8 

ADDR 

(general  trace  labels 

trcol 

-d 

-w 

15 

SYM 

#symbolic  adress 

trcol 

-d 

-w 

8 

DATA 

#data  bus  value 

trcol 

-d 

-w 

27 

DASM 

Idisassembled  DATA  label 

trcol 

-d 

-w 

6 

TYPE 

#type  of  operation 

trcol 

-d 

-w 

6 

STAT_ 

ladditional  Am29040  trace 

Displaying  a  large  number  of  labels  will  require  a  wide  trace  frame.  It  is  useful  to 
initially  define  a  large  View  frame,  which  can  later  be  switched  to  displaying  trace. 
When  UDB  is  invoked,  a  fixed  sized  null  frame  is  randomly  positioned  on  the  dis¬ 
play.  Using  the  following  udb.rc  command  sequence,  the  null  frame  can  be  replaced 
with  a  user  specified  window  size  positioned  at  the  top  left  hand  corner  of  the  screen. 
After  creating  the  View  frame,  the  null  window  is  deleted. 

(create  X  Y  rows  columns  type  udb.rc  command 

wcreate  0  0  -g  24  90  view  (create  new  View  frame 

wdelete  0  Idelete  original  null  window 

UDB  supports  the  binding  of  buttons  to  macro  commands.  This  is  a  convenient 
means  of  issuing  mktarget  commands,  rather  than  hard-wiring  them  into  udb.rc. 
This  simplifies  the  task  of  selecting  from  a  number  of  different  mktarget  options. 
Given  that  the  View  frame  is  established  during  UDB  start-up  (as  described  above), 
the  following  udb.rc  commands  assign  buttons  to  the  macro  table  associated  with  the 
View  frame.  Note  that  user  programmed  buttons  should  be  restricted  to  the  left  hand 
side  of  the  window.  The  example  command  sequence  creates  two  buttons.  The  button 
marked  “LA”  can  be  used  to  establish  a  connection  to  a  logic  analyzer.  The  “view  -u” 
command  causes  a  source  frame  to  be  invoked.  If  the  29K  program  counter  is 
currently  not  in  source  but  in  disassembly,  the  disassembly  view  will  be  invoked. 
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#  macro  table 

#  - 

macro  m=mview  -f 

macro  m=mview  -f 


button  position  command 


"iss"  { lbl }  "mktarget  ISS  1  29040  dr_gio  0  ios_udi 
simulator;  view  -u\r" 

"LA"  { lb2 }  "mktarget  LA  1  29040  dr_gio  0  ios_udi  -be 
trace;  view  -u\r" 


7.9.4  Corelis  Pl-Am29040  Preprocessor 

A  logic  analyzer  preprocessor  simplifies  the  connection  of  the  analyzer  to  the 
target  system.  The  principles  behind  its  operation  were  discussed  in  section  7.8.  This 
section  briefly  deals  with  the  operating  details  encountered  with  the  Am29040  pre¬ 
processor.  To  prepare  the  preprocessor  based  system,  a  number  of  steps  must  be  take: 

1.  The  PI-Am29040  Preprocessor  hardware  unit  replaces  the  Am29040 
processor  in  the  target  system.  The  preprocessor  contains  two  Am29040 
processors,  one  operating  in  master  mode,  the  other  in  slave  mode.  Earlier 
version  of  the  preprocessor  required  that  certain  pins  such  as  MEMCLK 
(H-14)  on  the  slave  processor  be  removed.  Later  versions  do  not  require  pin 
removal.  There  is  a  jumper  option  for  removing  the  slave  MEMCLK  signal  if 
it  is  configured  as  an  output.  If  MEMCLK  is  configured  as  an  input,  the  slave 
and  master  MEMCLKs  must  be  tied  together.  Because  of  the  high  speed 
operation  of  Am29040  based  systems,  the  use  of  PGA  socket  extenders 
should  be  limited  as  they  add  to  signal  propagation  delays.  It  is  often 
desirable  to  add  extenders  to  the  preprocessor  connection  pins  to  protect  them 
from  damage.  If  a  pin  gets  broken,  it  is  less  expensive  to  replace  a  socket  than 
to  replace  the  preprocessor.  Zero  ohm  resistors  have  been  incorporated  in 
series  with  a  number  of  signal  pins,  such  as  MEMCLK  and  INCLK. 
Impedance  matching,  and  hence  better  signal  conditioning,  can  be  achieved 
by  replacing  these  resistors  with  an  appropriate  value  resistor. 

2.  If  HP16550A  logic  analyzer  cards  are  being  used  with  the  HP16500B  system, 
then  two  cards  should  be  wired  together  in  accordance  with  the  HP 
Installation  Manual.  Two  analyzer  cards  provide  a  total  of  12  trace  pods. 
Assuming  the  cards  are  located  in  slots  D  and  E,  pod  El  (slot  E  -  master) 
should  be  connected  to  position  J1  on  the  preprocessor.  Pod  E2  to  position  J2, 
and  so  on.  Pods  D1-D3  should  be  connected  to  J7-J9  (see  Table  7-5).  The 
analyzer  configuration  file  POD_040._D  will  format  the  analyzer  cards  for 
this  configuration.  (The  ._D  file  name  postfix,  is  because  the  master  analyzer 
card  is  located  in  card  cage  slot  D.)  The  POD_040  configuration  file  is 
available  from  AMD  or  Corelis.  It  is  important  to  obtain  a  copy  of  the 
configuration  file,  as  it  is  much  too  time  consuming  to  reassign  the 
pod-to-label  mapping  by  hand. 
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pods  for 
capturing 
clock  signals 


Table  7-5.  PI-Am29040  Logic  Analyzer  Pod  Assignment 


Pl-Am29040 

HP16550A 

HP16555A 

Preprocessor  connector 

Analyzer  pod 

Analyzer  pod 

J1 

master  1  i 

expander  1 A 

J2 

master  2 

expander  2 

J3 

master  3 

expander  3 

J4 

master  4 

expander  4l 

J5 

master  5 

master  1 

J6 

master  6  1 

master  2 

J7 

expander  1 

J8 

expander  2 

master  3 

J9 

expander  3 

master  4 

3.  If  the  more  expensive  HP  16555 A  logic  analyzer  cards  are  selected,  two  cards 
are  still  required.  Once  again  they  should  be  wired  together  in  accordance 
with  the  HP  Installation  Manual.  Note  that  even  if  a  pair  of  cards  are 
purchased  together,  they  may  not  be  interconnected  in  accordance  with  HP’s 
manual  recommendations.  Two  HP16555A  analyzer  cards  provide  a  total  of 
6  trace  pods.  Assuming  the  cards  are  located  in  slots  A  and  B,  pod  B 1  (slot  B  - 
expander)  should  be  connected  to  position  J1  on  the  preprocessor.  Pod  B2  to 
position  J2,  and  so  on.  Pods  A1-A2  should  be  connected  to  J5-J6,  and  pods 
A3-A4  to  connections  J8-J9  (skipping  J7),  see  Table  7-5.  The  analyzer 
configuration  file  POD_040._A  will  format  the  analyzer  cards  for  this 
configuration.  Note  that  the  configuration  file  required  for  HP16555A  cards, 
although  the  same  name,  is  not  the  same  file  required  to  configure  HP16550A 
cards.  The  reason  B-pods  are  allocated  before  the  A-pods  is  because  the  card 
in  the  B  slot  is  wired  as  an  expander  card  and  all  clock  signals  must  be 
acquired  by  the  master  card  in  slot  A.  The  configuration  file  specifies  that 
trace  signals  are  captured  on  MEMCLK  signal  edges,  and  MEMCLK  is 
provided  on  connector  J5. 

4.  If  the  29K  target  system  is  operating  with  lx  clocking,  then  the  master  clock 
should  be  configured  to  acquire  trace  data  on  the  rising  edge  of  MEMCLK.  If 
2x  Scalable  Clocking  is  being  used,  the  master  clock  should  acquire  trace 
data  on  both  the  rising  and  falling  edge  of  MEMCLK.  The  appropriate 
selection  can  be  made  from  the  analyzer  control  panel  (remotely  if  desired). 
With  newer  versions  of  the  preprocessor  (those  that  provide  access  to  the 
processor’s  DIV2  pin),  the  MonTIP  software  warns  the  user  that  the  wrong 
edge-selection  has  been  made. 

5 .  There  is  no  need  to  install  the  Corelis  preprocessor  support  software  supplied 
with  the  PI-AM29040.  However,  if  it  is  installed  then  is  will  be  possible  to 
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present  disassembled  instructions  on  the  analyzer  display.  Using  the  Corelis 
software,  trace  label  DATA  can  be  displayed  in  Invasm  (inverse  assembler) 
format  without  resulting  in  any  conflicts  with  MonTIP’s  access  to  DATA. 
There  is  no  advantage  to  using  the  disassemby  software,  as  the  analyzer 
display  shows  instructions  which  are  fetched  but  not  necessarily  executed. 

6.  There  are  a  number  of  limitations  imposed  by  the  Am29040  Traceable  Cache 
architecture.  These  where  previously  discussed  in  section  7.8  under  the 
heading  Processing  Trace  Information.  Very  briefly,  the  complete  instruction 
flow  is  reported:  Labels,  DATA,  ADDR,  R/_W  and  I/_D  have  their  values 
manipulated  to  report  the  instruction  which  was  executed  during  the  traced 
cycle. 

7.  The  trace  data  processing  algorithms  built  into  MonTIP  need  to  know  the 
endian-ness  of  the  29K  target  processor.  When  connection  to  the  analyzer  is 
established,  a  window  displaying  the  analyzer  control  panel  will  appear. 
MonTIP  prints  a  message  in  this  window  indicating  the  endian-ness  of  the 
target  processor.  If  the  endian-ness  is  unknown,  MonTIP  will  continue 
operating;  but  sub  word-sized  data  accesses  will  only  be  partially  processed. 
To  fully  process  data  accesses,  the  “Analyzer  l:Name”  field  provided  under 
the  logic  analyzer  “Configuration”  window  should  be  set  to  AM29040B  or 
AM29040L,  respectively  for  big  or  little  endian  operation. 

8.  The  MonTIP  algorithms  are  currently  restricted  to  operating  with  systems 
which  fetch  instructions  from  32-bit  memory.  This  does  not  necessitate  that 
32-bit  ROM  emulation  be  used  with  NetROM.  If  application  programs  are 
loaded  and  execute  from  32-bit  memory,  they  can  be  successfully  traced. 
However,  if  interrupt  handlers  or  other  support  code  is  run  from  8-bit 
memory,  tracing  will  not  be  possible. 

9.  can’t  reduce  capture  rules 

7.9.5  Corelis  Pl-Am29460  Preprocessor 

A  logic  analyzer  preprocessor  simplifies  the  connection  of  the  analyzer  to  the 
target  system.  The  principles  behind  its  operation  were  discussed  in  section  7.8.  This 
section  briefly  deals  with  the  operating  details  encountered  with  the  Am29460 
preprocessor.  For  those  simply  interesting  in  getting  their  preprocessor  working,  and 
not  at  this  stage  needing  to  understand  the  background  behind  its  operation,  proceed 
to  the  section  with  the  subheading  PI-Am29460  Setup  and  Limitation. 

The  Traceable  Cache  information  provided  by  the  Am29040  slave  processor  is 
synchronous  with  program  execution.  If  this  approach  were  taken  with  the  Am29460 
microcontroller,  the  superscalar  execution  capability  would  necessitate  very  high 
speed  trace  reporting.  To  reduce  the  slave  processor’s  information  bandwidth 
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requirements,  the  Am29460  does  not  synchronize  trace  reporting  with  program 
execution.  The  Am29460  trace  information  is  compressed,  relative  to  the  Am29040 
trace  data,  and  held  in  an  output  queue  before  being  transferred  off-chip. 

Pl-Am29460  Preprocessor  Operation 

The  main  processor  operation  is  driven  by  the  PCI  INCLK  signal  pin.  However, 
for  data  capturing  purposes,  the  logic  analyzer  master  clock  is  the  slave  trace  clock 
(TRACECLK).  It  runs  at  half  the  internal  clock  speed.  Note  that  the  internal  clock 
(not  available  on  a  pin)  runs  at  2x,  3x  or  4x  the  INCLK  speed.  With  2x  clocking  and 
single-cycle  MCU  transfer  rates,  the  MCU  access  speed  would  equal  the 
TRACECLK  speed.  With  2x,  or  higher  scaling  ratios,  the  PCI  and  MCU  data  transfer 
rates  can  not  exceed  the  frequency  of  TRACECLK.  This  enables  TRACECLK  to  be 
used  as  the  master  clock  by  the  logic  analyzer. 

The  analyzer  captures  signal  values  when  the  master  clock  is  active.  There  also 
has  to  be  at  least  one  of  the  following  conditions:  valid  trace  information,  a  valid  PCI 
access,  a  valid  MCU  access.  PCI  accesses  are  first  captured  by  the  analyzer  slave 
clock.  The  PCI  INCLK  is  used  to  capture  logic  analyzer  slave  information.  Analyzer 
slave  signals  are  transferred  into  the  logic  analyzer  trace  buffer  on  the  next  master 
clock  signal.  If  more  than  one  slave  value  is  captured  before  the  next  master  clock, 
then  only  the  most  recent  slave  values  are  stored  in  the  analyzer  trace  buffer.  Lor  this 
reason  it  is  important  that  the  master  clock  operate  at  a  higher  frequency  than  the 
slave  clock. 

The  HP16500B  analyzer  only  supports  one  slave  clock.  Lor  this  reason,  MCU 
access  are  latched  and  held  until  the  next  master  clock  edge;  during  which  any  PCI 
access  captured  by  the  slave  clock  are  recorded  by  the  analyzer,  see  Ligure  7- 1 8 .  It  is 
impossible  for  two  MCU  accesses  to  occur  before  the  next  TRACECLK,  even  if 
another  agent  uses  the  PCI  to  access  an  MCU.  Anyway,  the  trace  processing 
algorithms  are  only  interested  in  MCU  accesses  initiated  by  the  processor,  not  the 
PCI.  The  preprocessor  and  its  supporting  software  are  not  intended  to  form  a  general 
purpose  PCI  probe. 


Logic  analyzer  trace  buffer 

TTTt  ^ 


MCU  latched  signals 

4  4  4  4 


t 


_STROBE 


CAS ,  RAS 


tttt 

PCI  signals 

4  4  4  4 


Master  clock 
TRACECLK 
(delayed) 


i —  Slave  clock 
INCLK  (and  other 
conditions) 


Figure  7-18.  Pl-Am29460  Preprocessor  Trace  Capture  Scheme 
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The  HP  logic  analyzer  specification  states  that  there  must  be  4nS  separating  the 
active  edge  of  the  slave  clock  and  the  active  edge  of  the  master  clock.  Master  clock  to 
slave  clock  separation  is  specified  as  OnS.  The  delay  is  required  to  ensure  that  the 
slave  information  is  valid  before  it  is  entered  into  the  logic  analyzer  trace  buffer  at  the 
active  master  clock  signal.  Signals  captured  directly  by  the  master  clock  have  a  4nS 
set-up  time  and  a  OnS  hold  time.  Consequently,  the  active  edge  of  the  master  clock 
must  not  be  allowed  to  arrive  within  4nS  of  an  active  slave  clock’s  arrival.  The 
preprocessor  achieves  this  by  delaying  the  TRACECLK  signal  used  to  generate  the 
analyzer  master  clock.  A  example  timing  sequence  is  presented  in  Figure  7-19. 


INCLK 

Internal  CLK 
MCU  CLK 

TRACECLK 

example 
MCU  access 

PCI  access 


Figure  7-19.  PI-Am29460  Preprocessor  Trace  Capture  Tuning 

There  is  an  additional  reason  for  delaying  the  TRACECLK;  the  slave  processor 
output  signals,  including  TRACECLK,  are  actively  driven  at  the  same  time.  Conse¬ 
quently,  slave  signals  which  are  to  be  sampled  using  the  TRACECLK  may  be  chang¬ 
ing  at  the  same  time  as  TRACECLK. 

A  benefit  is  obtained  as  a  result  of  using  MCU  latching  and  a  PCI  slave  clocking. 
That  benefit  is  the  better  utilization  of  analyzer  trace  depth.  When  a  trace  buffer  entry 
is  recorded  during  an  active  TRACECLK  edge,  trace  information  as  well  as  PCI  and 
MCU  information  is  captured  in  a  single  trace  line.  This  results  in  more  efficient  use 
of  the  trace  buffer  than  if  each  of  these  three  asynchronous  events  where  separately 
captured  by  the  logic  analyzer. 

RLE  Data  Pairs 

As  with  the  Am29040  processor,  the  second  Am29460  slave  processor  is 
entirely  responsible  for  providing  the  data  required  to  reconstruct  the  instruction 
execution  stream.  The  second  processor,  the  slave,  provides  three  types  of  trace  data: 
information  about  MCU  accesses,  information  about  PCI  accesses,  and  instruction 
execution  flow.  Unlike  the  Am29040,  the  slave  does  not  provide  any  information 
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about  data  cache  hit  activity.  Data  about  Instruction  flow  is  provided  in  the  form  of 
address-length  pairs,  known  as  Run  Length  Encoding  (RLE). 

The  Am29040  slave  processors  does  not  need  to  provide  access  type 
information  about  data  accesses,  as  they  can  be  fully  observed  by  monitoring  the 
main  processors  data  busses.  However,  many  of  the  bus  signals  available  with  a 
2-bus  processors  are  not  available  with  a  29K  microcontroller.  For  example,  the  I/_D 
pin  is  not  available.  This  means  when  a  memory  read  access  is  performed,  it  is  not 
possible  to  determine  if  data  or  an  instruction  is  being  fetched.  With  the  Am29460 
processor,  the  slave  provides  this  type  of  information.  This  explains  why  the  slave 
provides  trace  data  for  both  MCU  and  PCI  accesses  performed  by  the  master 
processor. 

Before  describing  the  RLE  technique  in  more  detail,  we  must  first  remind 
ourselves  of  the  speculative  execution  nature  of  the  Am29460  processor.  As 
explained  in  section  1.7,  instructions  are  fetched  and  speculatively  executed. 
However,  instruction  are  not  truly  consider  to  have  executed  until  they  have  been 
retired.  This  introduce  the  notion  of  a  Retire  Program  Counter  (R-PC).  At  any  time, 
instructions  who’s  addresses  are  ahead  of  the  R-PC  may  be  held  in  the  reorder  buffer 
waiting  for  potential  retirement.  Special  register  PCI  contains  the  address  of  the 
instruction  currently  in  execution.  Because  the  processor  supports  precise  interrupts, 
the  PC  1  register  can  never  get  ahead  of  the  R-PC  address.  When  a  trap  or  interrupt  is 
taken,  the  R-PC  value  will  appear  in  register  PCI  or  PCO  (decode  address) 
depending  on  the  stage  at  which  the  processor  pipeline  is  interrupted. 

Each  RLE  (TRACEADDR,  TRACERUN)  data  pair  specifies  that 
TRACERUN  instructions,  starting  from  the  current  R-PC,  have  been  retired,  and 
subsequent  retirement  is  to  continue  from  an  R-PC  value  of  TRACEADDR.  A 
TRACERUN  value  of  zero  is  permitted;  it  is  used  to  redirect  trace  flow  without 
recording  any  instruction  execution  (retirement).  In  such  case  a  TRACEADDR 
value  change  accompanies  the  TRACERUN  zero  value.  A  value  of  zero  is  also  used 
to  indicate  that  no  instructions  are  available  for  retiring.  In  such  case  the 
TRACEADDR  value  does  not  change.  An  example  RLE  trace  sequence  is  shown  on 
Figure  7-20. 

An  RLE  data  pair  can  be  output  by  the  slave  processor  during  the  rising  edge  of 
each  TRACECLK.  The  RLE  data  is  not  provided  directly  by  the  reorder  buffer,  but 
by  a  buffer  queue  which  receives  its  input  from  the  reorder  buffer,  see  Figure  7-21. 
This  is  necessary  as  very  short  run  lengths  produce  RLE  data  at  a  rate  faster  than  the 
TRACECLK  can  report  them.  Using  a  queue  reduces  the  need  to  stop  instruction 
retirement  until  the  RLE  data  has  been  presented  on  the  slave  output  pins.  However, 
the  queue  is  limited  in  size  and  consequently,  under  rare  circumstances,  it  can 
become  full.  This  results  in  the  master  processor  postponing  instruction  retirement 
until  a  queue  entry  is  available.  Without  this  throttling  back  approach,  uninterrupted 
reporting  of  instruction  flow  could  not  be  guaranteed. 
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Figure  7-20.  Slave  Data  Supporting  Am29460  Traceable  Cache 


Reorder  buffer  retirement 


Run-Length  Encoding 
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Figure  7-21.  RLE  Output  Queue  From  Reorder  Buffer 


Given  that  the  TRACECLK  runs  at  half  the  internal  speed  of  the  processor,  as 
many  as  8  instructions  could  be  retired  at  each  TRACECLK  interval.  However,  as  a 
means  of  reducing  RLE  queue  entries,  an  entry  is  not  placed  in  the  queue  if  the  current 
sequence  of  instructions  being  retired  does  not  contain  a  branch  instruction.  In  such 
case,  the  TRACRUN  is  allowed  to  accumulate  to  a  maximum  of  3 1 ,  the  largest  value 
which  can  be  reported  on  the  5-pins  allocated  to  TRACERUN.  Research  (see  section 
1.7.4,  [Johnson  1991])  has  shown  that  instruction  sequences  typically  contain  five  or 
six  instructions  before  branching.  This  would  indicate  that  TRACERUN  values  near 
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3 1  will  not  be  a  frequent  occurrence,  and  values  in  the  range  five  or  six  are  to  be 

expected. 

Pl-Am29460  Setup  and  Limitations 

As  with  the  PI-Am29040  preprocessor,  a  number  of  restrictions  and 

preparation  steps  apply  when  using  the  PI-Am29460  preprocessor: 

1.  The  trace  processing  algorithm  places  the  value  X_SLAVE_  in  the  DATA 
column  for  all  instruction  accesses.  It  is  necessary  to  have  access  to  the  COFF 
file(s)  for  the  loaded  program  to  ensure  the  X_SFAVE_  value  is  replaced  with 
the  actual  29K  instruction  executed.  Debuggers  such  as  UDB  can  retain 
multiple  COFF  file  images  at  the  same  time.  This  enables  simultaneous 
tracing  of  application  space  and  operating  system  space  (say,  interrupt 
handlers). 

2.  Considering  MCU  performed  memory  accesses,  only  data  accesses  are 
reported.  Data  transfer  is  shown  at  the  time  it  appears  on  the  system  busses; 
which,  for  data  stores,  may  be  several  cycles  after  the  corresponding  STORE 
instruction. 

Two  bus  29K  processors  have  OPT  pins  and  lower  (A1-A0)  address  pins 
which  indicate  the  size  and  alignment  of  the  object  currently  being  accessed. 
The  Am29460  microcontroller  does  not  have  these  pins.  Consequently  it  is 
not  possible  to  determine  the  alignment  and  size  for  reads  of  sub  word-sized 
objects.  Fortunately,  the  microcontroller  has  four  byte  enable  pins  which  are 
used  for  data  writes.  This  enables  the  alignment  and  size  of  objects  which  are 
written  to  be  determined,  and  improves  the  trace  reporting  for  data  writes. 

Only  MCU  accesses  performed  on  behalf  of  the  Am29460  processor  are 
reported.  Accesses  initiated  by  another  processor  via  the  PCI  interface  will 
not  appear  in  the  trace. 

3.  Considering  accesses  to  the  PCI  bus,  as  with  MCU  accesses,  only  data 
transfers  are  reported.  By  monitoring  the  PCI  command  provided  during  the 
address-phase  of  a  PCI  access,  it  is  possible  to  determine  the  object  size  for 
sub  word-sized  objects. 

4.  The  PI-Am29460  preprocessor  does  not  reconstruct  32-bit  MCU  addresses. 
This  can  complicate  logic  analyzer  triggering.  One  solution  is  to  use  the 
on-chip  breakpoint  control  registers  to  generate  a  _SYNC  pulse  which  is 
then  used  to  trigger  the  analyzer.  The  UDB  debugger  has  a  convenient  user 
interface  for  specifying  breakpoint  control  register  operation.  Unfortunately, 
however,  breakpoint  control  registers  are  a  limited  resource,  and  they  are  also 
used  to  control  program  execution. 

5.  Unlike  a  scalar  processor,  processed  trace  lines  with  the  Am29460,  indicate 
multiple  instruction  execution  per  trace  line.  The  number  of  instructions 
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reported  executed  by  a  trace  line  has  less  to  do  with  the  instruction  retirement 
rate  and  more  to  do  with  the  run-length  between  branch  instructions.  When 
an  MCU  or  PCI  access  occurs  at  the  same  time  as  RLE  reporting,  the 
processed  trace  indicates  all  activity  on  the  same  processed  trace  line. 

6.  The  trace  data  processing  algorithms  built  into  MonTIP  need  to  know  the 
endian-ness  of  the  29K  target  processor.  When  connection  to  the  analyzer  is 
established,  a  window  displaying  the  analyzer  control  panel  will  appear. 
MonTIP  prints  a  message  in  this  window  indicating  the  endian-ness  of  the 
target  processor.  If  the  endian-ness  is  unknown,  MonTIP  will  continue 
operating;  but  sub  word-sized  data  accesses  will  only  be  partially  processed. 
To  fully  process  data  accesses  the  “Analyzer  l:Name”  field  provided  under 
the  logic  analyzer  “Configuration”  window  should  be  set  to  AM29460B  or 
AM29460L,  respectively  for  big  or  little  endian  operation. 

7.  If  HP16550A  logic  analyzer  cards  are  being  used  with  the  HP16500B  system, 
then  two  cards  should  be  wired  together  in  accordance  with  the  HP 
Installation  Manual.  Two  analyzer  cards  provide  a  total  of  12  trace  pods.  If 
the  more  expensive  HP16555A  logic  analyzer  cards  are  selected,  three  cards 
are  required.  Once  again  they  should  be  wired  together  in  accordance  with  the 
HP  Installation  Manual.  Note  that  even  if  cards  are  purchased  together,  they 
may  not  be  interconnected  in  accordance  with  HP’s  manual 
recommendations . 

Table  7-6  shows  the  assignment  of  analyzer  pods  to  preprocessor  connectors. 
The  analyzer  configuration  file  POD_460._A  will  format  the  analyzer  cards 
for  this  configuration.  (The  ,_A  file  name  postfix,  is  because  the  master 
analyzer  card  is  located  in  card  cage  slot  A.)  The  POD_460  configuration  file 
is  available  from  AMD  or  Corelis.  It  is  important  to  obtain  a  copy  of  the 
configuration  file,  as  it  is  much  too  time  consuming  to  reassign  the 
pod-to-label  mapping  by  hand. 

Note  that  the  configuration  file  required  for  HP16555A  cards,  although  the 
same  name,  is  not  the  same  file  required  to  configure  HP16550A  cards. 

8.  A  logic  analyzer,  controlled  by  UDB,  may  be  attached  to  a  29K  target  system 
which  is  not  under  UDB  control.  This  is  the  case  where  a  logic  analyzer  is 
used  alone,  without  the  utilization  of,  say,  a  NetROM.  It  is  also  the  case  when 
previously  captured  trace  data  is  reexamined.  To  enable  32-bit  address 
reconstruction,  the  algorithms  built  into  MonTIP  need  to  know  the 
processor’s  bank  profile  register  (BPR)  settings.  MonTIP  normally 
accomplishes  this  by  accessing  the  DebugCore  each  time  trace  data  is 
fetched.  When  no  DebugCore  is  present,  MonTIP  is  provided  the  BPR  values 
from  the  udb.rc  initialization  file. 
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pods  for 
capturing 
clock  signals 


Table  7-6.  PI-Am29460  Logic  Analyzer  Pod  Assignment 


Pl-Am29460 

HP16550A 

HP16555A 

Preprocessor  connector 

Analyzer  pod 

Analyzer  pod 

J1 

master  1  ,  k 

master  1  A 

J2 

master  2 

master  2 

J3 

master  3 

master  3 

J4 

master  4 

master  4 

J5 

master  5 

first  expander  1 

J6 

master  6  '  ’ 

first  expander  2 

J7 

expander  1 

first  expander  3 

J8 

expander  2 

first  expander  4 

J9 

expander  3 

second  expander  1 

J10 

expander  4 

second  expander  2 

J11 

expander  5 

second  expander  3 

J12 

expander  6 

second  expander  4 

The  BPR  registers  are  mapped  into  the  processors  I/O  address  space.  The 
UDB  “outl  <address>  <value>”  command  can  be  used  to  write  a  32-bit  value 
to  an  I/O  location.  Note  that  the  command  can  also  be  used  to  write  to  a 
memory  location,  but  this  first  requires  an  “<ESC>  ioctl  space  d”  command 
be  first  used  to  switch  output  to  memory  space  (“d”)  rather  than  the  default 
I/O  space  (“i”).  After  connection  to  the  analyzer  has  been  established,  “outl” 
commands  can  be  used  to  set  BPR  values  for  use  by  MonTIP.  This  is  best  done 
by  binding  a  user  defined  button  to  the  Trace  frame.  When  MonTIP  has  no 
connection  to  a  29K  target,  it  does  not  try  and  set  the  real  BPR  registers,  but 
retains  the  values  for  future  use.  The  following  udb.rc  command  sequence 
defines  a  BPR  button  for  an  example  register  initialization.  Before  analyzer 
data  is  fetched,  the  BPR  button  should  first  be  clicked.  This  enables  the 
algorithms  to  correctly  build  address  values. 


#ioctl  space  i 

macro  m=mtrace  -f  -"BPR"  { lb 3 }  "{.com}outl  ffffff80  00001003H; 
outl  ffffffaO  800c6005H;  outl  ffffffbO  900c6105H;  outl  ffffffcO 
a00c6205H;  outl  ffffffdO  b00c6305H; \r" 


9.  The  TRACEADDR  addresses  provided  by  the  slave  processor  are  virtual  — 
assuming  address  translation  is  in  use.  However,  the  address  values  observed 
for  MCU  and  PCI  accesses  are  always  physical.  This  creates  difficulty  when 
looking-up  the  MCU  and  PCI  addresses  in  the  loaded  COFF  file.  A  program 
must  run  with  physical  addressing,  or  with  one-to-one  virtual  to  physical 
address  translation,  if  MCU  and  PCI  address  symbols  are  to  be  correctly 
reported. 
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10.  The  PI-Am29460  preprocessor  contains  additional  analyzer  connections 
(J 13— J 16).  These  are  for  use  by  hardware  development  engineers.  They  are 
not  required  for  program  tracing.  They  are  provided  to  enable  capturing  of 
unlatched  processor  signals.  A  number  of  the  connectors  used  for  software 
tracing  latch  their  signal  values,  and  this  disrupts  the  analyzers  visibility  of 
timing  relationship.  A  hardware  engineer  can  use  the  alternative  connectors 
to  view  unlatched  versions  of  the  main  processor  signal  pins. 
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Chapter  8 


Selecting  a  Processor 


This  chapter  helps  with  the  sometimes  difficult  task  of  processor  selection. 
Processors  are  considered  in  terms  of  their  performance  and  software  programming 
requirements.  There  is  little  attempt  to  review,  say,  development  tools  or  bus  timing 
for  alternative  processors.  Consequently  the  chapter  is  of  most  interest  to  software 
engineers  and  computer  scientists.  In  undertaking  comparative  processor  evaluation, 
the  often  confusing  task  of  performance  benchmarking  is  studied  for  dissimilar 
processors.  This  chapter  will  enable  you  to  better  understand  the  methodologies  used 
by  manufacturers  trying  to  win  the  benchmark  race,  and  presents  an  approach  which 
will  enable  you  to  more  accurately  determine  a  processor’s  performance  for  your 
own  application. 

Processor  execution  speeds  are  restricted  to  integer  performance  evaluation.  No 
attention  is  given  to  floating-point  performance.  This  should  not  be  disappointing,  as 
the  selection  of  a  processor  is  greatly  limited  if  floating-point  performance  is  critical. 
Most  manufacturers  have  processors  (such  as  the  Am29050)  which  are  specifically 
intended  for  floating-point  use.  Additionally,  the  large  majority  of  systems  are  not 
concerned  with  floating-point  operations. 

The  well  known  Stanford  benchmark  developed  at  Stanford  University  is  used 
for  performance  comparisons.  It  is  easy  to  criticize  the  choice  of  a  synthetic 
benchmark.  However,  it  is  difficult  to  come  up  with  with  a  more  acceptable 
alternative  which  everyone  will  agree  upon.  At  least  the  Stanford  benchmark  is  more 
revealing  than  the  over  used  (and  often  unreliable)  Dhrystone  benchmark.  Separate 
results  for  six  of  the  integer  routines  taken  from  the  integer-part  of  the  Stanford  code 
will  be  shown.  The  six  were  chosen  because  of  their  diversity  in  function  and 
similarity  in  execution  times.  This  similarity  made  for  clearer  scaling  and  hence 
easier  comparison  of  the  results. 
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The  Stanford  benchmark  is  relatively  small  and  can  have  high  instruction  cache 
hit  ratios.  It  also  does  not  exhibit  the  large  data  movement  activities  typical  of 
communications  applications.  For  this  reason  a  LAPD  benchmark,  which  is  larger 
and  more  representative  of  communication  applications  is  also  used.  The  LAPD 
acronym  stands  for  Link  Access  Protocol-D.  It  is  an  ISDN  protocol  used  by  the 
communications  industry  when  sending  packet  information  between  a  caller  and 
callee.  The  benchmark  is  intended  to  measure  a  processor’s  ability  to  crunch  a 
typical  layered  protocol  stack.  The  LAPD  code  used  is  based  on  a  prior  AMD, 
“AmLink”,  software  product.  The  benchmark  is  in  three  parts:  Send  an  information 
package  and  receive  an  un-numbered  acknowledge;  Receive  an  information 
package  and  respond  with  an  un-numbered  acknowledge;  And  send  an  information 
package  and  receive  an  information  package.  Results  are  presented  in  terms  of 
geometric  mean  values  for  packet  switching  speeds  for  the  three  parts  (the  geometric 
mean  is  found  by  multiplying  the  three  results  and  taking  the  cube-root  of  the 
product). 

The  performance  results  presented  can  act  as  a  guideline  for  your  own 
application.  However,  the  only  certain  way  to  know  a  processor’s  performance  for 
any  particular  processor/memory  configuration  is  to  benchmark  your  own  code  on 
the  system  or  an  Architectural  Simulator. 


8.1  THE  29K  FAMILY 

Chapter  1  described  the  features  of  the  29K  family  members  in  detail.  The 
family  is  divided  into  three  main  groupings:  three-bus  microprocessors,  two-bus 
microprocessors,  and  microcontrollers.  This  section  will  concentrate  on  the 
sometimes  difficult  task  of  selecting  a  particular  family  member.  When  designing  a 
new  microprocessor  system,  price  and  performance  expectations  restrict  the  choice 
of  available  processors.  It  is  not  acceptable  to  select  a  high-end  processor  with  a  fast 
memory  system  when  the  budget  requires  a  low  system  cost.  It  is  equally  important  to 
be  aware  that  a  low-end  processor  with  inexpensive  memory  system  may  not  have 
the  required  performance.  There  can  also  be  other  design  restrictions,  such  as  low 
power  consumption  or  short  development  time,  that  further  influence  the  processor 
selection.  The  problem  of  selecting  a  processor  is  dominated  by  the  difficulty  of 
evaluating  relative  performance  of  different  processor  and  memory  combinations. 
To  help  resolve  this  problem,  I  have  simulated  a  wide  range  of  potential  systems  and 
determined  their  relative  performance.  The  results  are  presented  in  the  following 
sections. 

The  review  is  divided  into  two  sections:  first,  microcontrollers;  and  second,  all 
types  of  microprocessors.  The  division  is  natural.  One  of  the  first  decisions  to  be 
made  is  whether  to  use  a  microcontroller  or  a  microprocessor. 
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They  each  have  advantages  and  disadvantages,  summarized  below: 

Performance  The  29K  microcontrollers  make  available  a  wide  range  of  system 
performance.  However,  they  do  not  enable  construction  of  the 
fastest  systems.  The  29K  2-bus  microprocessors  have  the 
advantage  of  higher  processor  clock  speeds  and  larger  on-chip 
cache.  They  can  also  operate  with  faster  memory  systems, 
although  the  construction  of  these  fast  memory  systems  is  not  as 
simple  as  attaching  a  memory  system  to  a  microcontroller. 

Design  Time  The  hardware  design  time  is  less  with  a  microcontroller.  This  is 
mainly  because  the  microcontroller  contains  memory  interface 
controllers  on-chip.  There  is  no  need  to  build  any  DRAM  refresh 
circuitry  or  memory  interface  logic.  A  number  of  frequently 
required  peripheral  devices,  such  as  UARTs  and  Input/Output 
(I/O)  ports  are  also  incorporated  into  the  microcontrollers, 
eliminating  the  need  to  select,  integrate,  and  debug  these 
peripherals  when  they  are  required  by  the  system  is  an  advantage. 

System  Cost  Microcontroller  systems  generally  cost  less  to  design  and 
construct;  they  offer  good  value.  The  higher  price  of  2-bus 
microprocessors  is  justifiable  if  higher  system  performance  is 
required.  Additionally,  the  higher  cost  of  the  associated  high 
performance  memory  system  makes  the  higher  microprocessor 
cost  more  acceptable. 

Future  Expansion  Frequently  systems  are  required  to  be  built  at  different 
price/performance  combination.  Both  types  of  processor  have 
something  to  offer  in  this  area.  The  2-bus  processors  are  all  pin 
and  bus  compatible.  It  is  possible  to  initially  design  with  an 
Am39035  processor  using  a  2/1  DRAM  memory  at  16  MHz.  The 
processor  can  be  directly  replaced  with  an  Am29030  or  Am29040, 
each  offering  additional  performance.  Additionally,  the  faster 
processors  could  be  used  at  33  MHz  using  Scalable  Clocking  to 
achieve  the  highest  performance  system.  Each  processor  has  a 
different  cost.  But,  without  redesigning  the  system,  a  simple 
performance  upgrade  (or  alternatively  down  grade)  path  is 
available. 

The  upgrade  path  is  not  as  simple  with  microcontrollers.  However, 
it  is  possible  (and  frequently  done)  to  build  a  system  with  a 
multiple  microcontroller  foot  print.  The  Am29240  device  is 
bigger  than  the  Am29200,  which  in  turn  is  bigger  than  the 
Am29205.  The  difference  in  the  physical  size  of  the  devices  (the 
foot  prints)  enables  a  board  layout  with  a  concentric  pad  site  for  all 
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three  microcontrollers.  Hence,  the  same  board  can  be  utilized  with 
different  processors.  However,  because  of  the  different  access 
timing  of  DRAMs  used  by  microcontrollers,  it  would  be  necessary 
to  also  upgrade  the  memory  devices.  This  does  not  generally 
present  much  of  a  problem.  As  a  single  board  layout  can  easily 
accommodate  different  memory  device  speeds. 

Peripherals  Microcontrollers  have  the  advantage  of  on-chip  peripherals.  As 
well  as  simplifying  the  design  process,  they  enable  a  smaller  board 
layout  area  and  reduced  system  power  consumption.  The  close 
coupling  of  the  on-chip  peripherals  to  the  processor,  enables  fast 
communication  between  the  two,  even  at  high  clock  speeds.  There 
is  also  no  extra  cost  for  the  peripherals. 

Memory  Choice  When  DRAM  is  used,  microprocessors  enable  a  wider  range  of 
memory  systems  to  be  constructed.  The  Am2920x 
microcontrollers  only  support  3/2  DRAM  access  (3-cycle  first 
access,  thereafter  2-cycle  burst).  The  higher  performance 
Am2924x  microcontrollers  only  support  2/1  DRAM  access. 
Burst-mode  can  be  disabled  resulting  in  slower  3/3  and  2/2 
memory  access  respectively,  but  there  is  little  else  that  can  be 
modified.  However,  for  many  systems,  the  restrictions  inherent  in 
the  built-in  microcontroller  DRAM  interface  will  present  no 
problem. 

Board  Size  Microcontrollers  are  much  more  likely  to  enable  a  smaller  board 
layout  area.  They  have  less  need  for  peripheral  support  circuitry, 
particularly  because  of  their  built-in  memory  interfaces. 

Power  Consumption  The  Am2924x  microcontrollers  can  operate  at  3.3  volts  and 
support  power  saving  operating  modes.  Additionally, 
microcontroller  based  systems  have  less  need  for  power 
consuming  peripherals.  This  gives  the  microcontrollers  the 
advantage  when  constructing  a  low  power  system.  However,  the 
Am2920x  low  cost  microcontrollers  do  not  have  the  power  saving 
features.  The  only  microprocessor  particularly  suited  to  low 
power  systems  is  the  Am29040.  Because  of  its  higher  cost  and 
higher  clock  rate,  its  use  is  restricted  higher  performance  systems. 

Tool  Selection  User  mode  code  is  compatible  across  the  entire  29K  family.  This 
means,  for  instance,  a  C  compiler  can  produce  code  for  any  of  the 
processors.  However,  there  are  optimizations,  such  as  the  use  of 
integer  multiply  with  the  Am29240  and  Am29040  processors, 
which  can  improve  a  particular  processor’s  performance.  Tool 
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selection,  with  the  exception  of  certain  debuggers,  is  likely  to  be 
equally  available,  priced,  and  effective  across  the  29K  family. 

Multiprocessor  The  bus  snooping  capability  of  the  Am29040  makes  this 
microprocessor  the  clear  choice  for  a  complex  multiprocessor 
design.  For  less  complex  systems,  where  a  29K  may  be  used  as  a 
coprocessor  for  a  peripheral  task,  the  microcontrollers  have  an 
on-chip  parallel  port  which  can  be  used  to  communicate  with  the 
main  processor.  When  a  processor  is  used  in  conjunction  with  an 
off-chip  DMA  controller,  or  other  agent  accessing  shared 
memory,  it  is  important  that  a  data  cache  (if  used)  support  a 
write-through  or  copy-back  (with  snooping)  policy.  However,  a 
write-through  data  cache  still  has  problems  with  data  consistency 
when  another  agent  wishes  to  write  shared  memory  (see  section 
5.14).  The  techniques  supported  by  the  29Kmembers  are  superior, 
in  terms  of  data  consistency,  to  simply  using  on-chip  SRAM. 

Debug  Support  Processors  are  equally  supported  with  software  simulators  and 
low  cost  debug  tools  such  as  ROM  emulators.  The  effectiveness  of 
low  cost  software  debug  tools,  such  as  ROM  emulators  and  debug 
monitors,  is  enhanced  with  on-chip  debug  features  such  as 
Monitor  node  and  breakpoint  control  registers.  It  is  mainly  the 
higher  performance  processors  which  have  these  on-chip  debug 
features.  The  most  popular  processors  are  supported  with  In 
Circuit  Emulators  (ICE)  supplied  by  AMD  partners.  There  are 
also  hardware  and  software  personality  modules  which  enable 
logic  analyzers  to  be  used  across  the  family  for  hardware  and 
low-level  software  debugging. 

The  simulation  results  presented  in  the  following  sections  were  obtained  using 
the  Architectural  Simulator.  This  simulator  accurately  models  processor  operation, 
and  can  be  used  to  evaluate  any  potential  29K  system.  Use  of  the  simulator  was 
described  in  detail  in  sections  1.14.  An  event  file  is  required  to  describe  the  system’s 
characteristics.  For  example,  the  file  below,  200_3232_2232.evt,  was  used  to 
describe  an  Am29200  microcontroller  which  had  a  32-bit  ROM  and  DRAM 
memory  system  (the  3232  part  of  the  file  name),  with  2/2  ROM  access  and  3/2 
DRAM  access  (the  2232  part  of  the  file  name). 


; Architectural 

romread 

romwrite 

romburst 

rombread 

romwidth 


Simulator  event 
2 
2 

false 

2 

32 


file,  200_3232_2232 . evt 
;  ROM  space,  2/2  access 

; burst  off 

; 32-bit  ROM-space 


ramread  3 


DRAM  space,  3/2  access 
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; pagemode  on 


ramwrite  3 

rampage  true 

rampread  2 

rampwrite  2 

ramwidth  32 


; 32-bit  DRAM  space 


By  building  new  event  files,  it  is  possible  to  re-run  simulations  and  evaluate  the 
effect  on  the  system’s  performance.  The  simulator  was  run  using  the  command 
below: 


sim29  -29200  -e  2 00_3232_2232 . evt  a . out 

The  program  being  simulated,  shown  as  a.out  above,  was  the  LAPD 
benchmark.  I  chose  to  use  LAPD  rather  than  Stanford  because  of  the  high  instruction 
cache  hit  ratio  of  the  Stanford  benchmark — above  90%  with  even  very  small  caches. 
I  believe  modeling  the  performance  of  LAPD  is  more  likely  to  reflect  the  actual 
performance  most  users  will  experience  with  their  own  application  code.  However, 
as  always,  I  urge  you  to  use  your  own  code  when  benchmarking  various  processors. 
The  LAPD  benchmark  is  good  at  testing  data  movement  and  bit  field  (packet  header) 
operations,  but  this  may  not  be  your  requirement.  Additionally,  the  Metaware 
compiler  was  used  with  a  high  level  of  optimization  (-07)  when  compiling  the 
benchmark.  This  produces  the  best  performance  but  may  require  additional  memory 
to  hold  the  expanded  code  which  results  from  such  optimizations  as  loop  unrolling. 

8.1.1  Selecting  a  Microcontroller 

Microcontrollers  are  studied  and  grouped  in  this  section  according  to  their 
memory  system  speed.  Initially,  systems  based  on  16  MHz  memory  are  analyzed. 
The  performance  of  both  16-  and  32-bit  wide  memories  is  presented.  However,  no 
8-bit  systems  are  included.  Very  small  systems  based  on  8-bit  memories  and  using 
the  Am2920x  microcontrollers  are  evaluated  in  a  separate  section  (section  8.1.2, 
Moving  up  to  an  Am2920x  Microcontroller).  Memory  systems  operating  at  12.5 
MHz  are  also  dealt  with  in  the  section  dealing  with  very  small  systems. 

16  MHz  Memory  Systems 

Setting  12.5  MHz  systems  aside,  16  MHz  is  the  entry  level  system  speed.  This 
can  be  achieved  using  a  16  MHz  Am29200,  Am29205  or  Am29245  processor,  or  an 
Am29240  using  Scalable  Clocking.  When  Scalable  Clocking  technology  is  used,  a 
33  MHz  Processor  would  be  combined  with  a  16  MHz  memory  system.  Both 
instruction  and  data  accesses  are  satisfied  by  the  slower  16  MHz  memory.  The 
simulation  results  for  various  systems  running  the  LAPD  benchmark  are  shown  in 
Figure  8-1.  Memory  access  times  for  the  evaluated  systems  are  shown  in  the 
notation:  (initial/subsequent),  for  example  2/1. 

Programmable  Data  width  was  used  to  model  16-bit  and  32-bit  memories.  As 
expected  and  supported  by  the  results,  the  16-bit  memory  systems  offer  less 
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424 


Evaluating  and  Programming  the  29K  RISC  Family 


performance.  Not  all  of  the  modeled  systems  are  likely  candidates  for  construction. 
They  are  shown  merely  to  report  their  relative  performance.  Some  of  the  most 
interesting  systems  are  highlighted.  For  example,  the  second  from  the  top  entry 
shows  an  Am29205  system  with  16-bit  2/2  ROM  and  3/2  DRAM.  This  is  an  entry 
level  system.  The  first  entry  shows  an  Am29205  operating  from  16-bit  DRAM 
alone.  The  notation  “*/*”  in  the  ROM/SRAM  column  indicates  that  no  ROM 
memory  was  used.  Such  a  system  would  require  initialization  of  the  DRAM  memory. 
This  could  be  achieved  with  an  8-bit  ROM  which  transferred  its  contents  to  DRAM 
before  application  code  execution  commenced.  Note,  it  is  not  possible  to  build  a 
DRAM-only  system  where  a  dual-ported  DRAM  is  initialized  by  another  processor. 
This  is  because  after  reset,  program  execution  commences  from  ROM  region  0.  This 
does  mean  an  SRAM-only  system  could  be  constructed;  assuming  that  the  SRAM  is 
located  in  ROM  region  0,  and  is  somehow  initialized  before  processor  reset. 

The  second  entry,  the  212-312  system,  was  linked  such  that  instructions  were 
fetched  from  the  2/2  ROM  space;  all  data  was  accessed  from  the  3/2  DRAM.  The 
combined  ROM-DRAM  system  is  faster  than  the  3/2  DRAM-only  system.  The 
DRAM-only  system  has  81%  of  the  faster  system’s  performance.  This  is  due  to 
instruction  accesses  being  directed  to  the  faster  2/2  memory  and  the  frequent 
occurrence  of  DRAM  precharge  cycles.  The  Am29200  DRAM  is  frequently  referred 
to  as  3/2,  this  assumes  the  1-cycle  of  RAS-precharge  is  hidden.  When  DRAM-only 
systems  are  used,  the  precharge  is  not  likely  to  be  hidden,  and  the  access  is  truly  4/2 
rather  than  3/2.  This  is  explained  in  section  1.14.1  under  the  Am29200  and  Am29205 
subheading.  Given  that  even  inexpensive  EPROM  devices  can  be  1.5  to  2-times  the 
cost  of  DRAM  (per  byte),  it  is  less  expensive  to  use  a  single  8-bit  EPROM  to 
initialize  the  DRAM,  and  then  execute  the  program  from  DRAM.  However,  there  is  a 
loss  of  performance  with  this  technique. 

The  sixth  entry  shows  an  Am29205  system  with  1/1  ROM  and  3/2  DRAM.  The 
system  has  substantially  increased  performance  over  the  2/2  ROM  system  (66% 
faster).  The  notation  1/1  is  used  here  to  indicate  instruction  read  access  times  only. 
The  microcontroller  family  requires  one  wait  state  when  writing  to  ROM  space.  This 
results  in  a  minimum  write  access  time  of  2-cycles  for  ROM  space.  Although  this  is 
important  to  note,  it  has  no  impact  here  as  the  system  performs  all  data  writes  to 
DRAM.  However,  the  system  is  unbuildable  due  to  the  unavailability  of  ROM 
devices  which  can  deal  with  the  very  fast  access  times. 

The  access  times  for  ROM  space  are  determined  by  three  parameters.  First,  the 
period  of  the  memory  system  clock  (CP)  —  all  memory  accesses  are  synchronized  to 
the  system  clock.  Second,  the  delay  before  synchronous  outputs  become  valid  (OV). 
Third,  the  input  setup  time  (IS)  for  synchronous  input  signals.  When  performing 
single-cycle  memory  access,  the  access  time  is  determined  from  the  ROMOE  signal 
becoming  valid  after  the  falling  edge  of  MEMCLCK  (OVp).  When  wait  states  (WS) 
are  used,  the  access  time  is  determined  from  the  address  outputs  becoming  valid  after 
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Table  8-1.  Memory  Access  Times  for  Am2920x  Microcontroller  ROM  Space 


Memory  Bus 
Speed  (MHz) 

Clock  Period 
(CP  ns) 

Output  Valid 
(OVr,  OVF  ns) 

Input  Setup 
(IS  ns) 

Memory  Access  Times  (ns) 
0-Wait  1-Wait  2-Wait 

12.5 

80 

15,  15 

12 

13 

133  213 

16 

62.5 

11,  10 

10 

11.25 

105  167.5 

20 

50 

11,  10 

10 

5 

79  129 

the  rising  edge  of  MEMCLK  (OVr).  The  equations  below  can  be  used  to  calculate 
the  required  minimum  memory  access  times. 

Memory  Access  Time  = (Clock  Period) /2  -(Output  Valid)  -(Input  Setup) 

=  (CP/2 )  -  0VF  -  IS  ,WS=0 


Memory  Access  Time  = (Period) *( 1+Wait  States )- (Output  Valid) -( Input  Setup) 

=  (CP  *  (1  +  WS))  -  0VR  -  IS  ,  WS  >  0 

Shown  on  Table  8-1  are  the  required  memory  access  times  for  Am2920x  ROM 
space  memory.  The  1/1  access  times  are  given  under  the  zero  wait  state  column.  At 
16  MHz,  an  11.25  ns  access  time  must  be  supported.  ROM  devices  at  this  speed  are 
not  available.  However,  the  access  times  for  2/2  ROM  (1-wait)  are  reasonable,  and 
can  be  achieved  with  readily  available  90  ns  ROM  devices. 

It  is  not  until  a  20  MHz  memory  system  is  required  that  particularly  fast  ROM 
need  be  used.  At  this  stage  an  interleaved  ROM  system  could  be  built  or  faster  ROM 
purchased  for  a  higher  cost.  Alternatively,  FLASH  memory  could  be  used.  FLASH  is 
generally  available  with  faster  access  times  than  EPROM.  Table  8-2  lists  a  number  of 
current  AMD  memory  devices  and  their  access  times.  Faster  and  larger  devices  are 
always  being  developed  by  AMD  and  other  manufactures.  It  is  likely  that  before  long 
new  memory  devices  will  become  available  and  enable  faster  systems  to  be 
constructed  at  lower  cost. 

The  ninth  entry  shows  a  32-bit  Am29200  based  system  using  2/2  ROM  and  3/2 
DRAM.  This  may  be  a  popular  system  for  construction.  The  32-bit  DRAM-only 
system  has  only  71%  of  the  combined  ROM-DRAM  system.  Thus,  the  addition  of 
32-bit  wide  ROM  will  be  justifiable  for  those  systems  requiring  extra  performance. 

Faster  DRAM-based  systems  can  be  constructed  using  the  2/1  DRAM 
controller  incorporated  into  the  more  expensive  Am2924x  microcontrollers.  The 
Am29245  is  the  least  expensive,  and  it  is  shown  with  a  32-bit  DRAM-only  system 
(*/*  2/1)  in  entry  thirteen  of  the  table.  The  previous  Am29200  system  based  on 
ROM-DRAM,  has  only  71%  of  the  performance  of  the  DRAM-only  Am29245 
system.  Interestingly,  the  Am29240  system  using  16-bit  DRAM  and  Scalable 
Clocking  is  shown  to  be  faster  than  the  Am29245  using  32-bit  DRAM.  This  is  due  to 
the  higher  internal  clock  rate  and  data  cache  of  the  Am29240. 
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Table  8-2.  ROM  and  FLASH  Memory  Device  Access  Times 


AMD  Device 

Speeds  (ns) 

Capacity 

Memory  Type 

Am27C010 

90,  120.  150 

128k  x  8 

EPROM 

Am27C020 

80,  120,  150 

256k  x  8 

EPROM 

Am27C040 

90,  120,  150 

512k  x  8 

EPROM 

Am27C080 

90,  120,  150 

1M  x  8 

EPROM 

Am28F010 

90,  120,  150 

128k  x  8 

FLASH 

Am28F020 

90,  120,  150 

256k  x  8 

FLASH 

Am29F010 

45,  55,  70 

128k  x  8 

FLASH 

Am29F040 

70,  90,  120 

512k  x  8 

FLASH 

The  fastest  DRAM-only  system,  third  from  the  bottom,  is  an  Am29240  using 
Scalable  Clocking  and  32-bit  DRAM.  This  system  is  130%  faster  than  the  examined 
Am29200  using  ROM-DRAM.  However,  it  is  more  expensive  due  to  the  premium 
speed  microcontroller.  An  alternative  is  to  use  a  less  expensive  Am29200  with 
SRAM.  Shown  on  Figure  8-1  is  a  32-bit  1/1  SRAM  based  system  which  is  100% 
faster  than  the  studied  ROM-DRAM  system.  When  examining  SRAM-only 
systems  (such  as  the  1/1  */*  example),  the  benchmark  program  was  linked  such  that 
both  instructions  and  data  where  accessed  from  SRAM.  In  practice  this  would  likely 
require  programs  to  be  located  in  8-bit  ROM,  and  transferred  to  SRAM  during  the 
initialization  stage.  Unfortunately,  SRAM  is  about  eight  times  the  cost  of  DRAM  on 
a  per-byte  basis.  However,  if  only  a  small  amount  of  SRAM  is  required,  the  system 
may  be  cost  effective,  given  the  lower  processor  cost.  At  12.5  MHz,  zero  wait  state 
access  requires  13  ns  SRAM.  Such  devices  are  readily  available. 


Table  8-3.  Memory  Access  Times  for  Am2924x  Microcontroller  ROM  Space 


Memory  Bus 
Speed  (MHz) 

Clock  Period 
(CP  ns) 

Output  Valid 
(OVR,  OVf  ns) 

Input  Setup 
(IS  ns) 

Memory  Access  Times  (ns) 
0-Wait  1-Wait  2-Wait 

16 

62.5 

10,9 

7 

15.25 

108 

170.5 

20 

50 

10,9 

7 

9 

83 

133 

25 

40 

10,9 

7 

4 

63 

103 

33 

31.25 

10,9 

7 

- 

45.5 

76.75 

20  MHz  Memory  Systems 

Microcontroller  based  systems  using  20  MHz  memory  systems  are  shown  in 
Figure  8-2  When  using  DRAM,  these  systems  are  always  faster  than  16  MHz 
systems.  However,  a  20  MHz  Am29200  system  using  2/2  ROM  and  3/2  DRAM  has 
only  62%  of  the  performance  of  a  16  MHz  SRAM  system. 
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With  20  MHz  Memory  Systems 
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Building  an  Am29200  system  with  1/1  SRAM  at  20  MHz  requires  5  ns  memory 
access  times.  These  are  much  more  expensive  than  the  1 1 .25  ns  memories  required  at 
16  MHz.  To  reduce  cost,  an  interleaved  SRAM  system  could  be  constructed.  This 
would  result  in  2/1  SRAM  access.  However,  this  achieves  only  90%  of  the 
performance  on  an  1/1  SRAM  system  operating  at  16  MHz.  It  would  be  better  to 
build  the  slower,  less  expensive,  yet  higher  performing  16  MHz  system. 

With  20  MHz  memory  systems,  the  Am2920x  microcontrollers  are  operating  at 
their  maximum  frequency.  As  more  performance  is  required,  the  likelihood  of 
selecting  an  Am29240  processor  increases.  This  is  particularly  true  if  DRAM-only  is 
to  be  used.  An  Am29240  using  32-bit  DRAM-only  (2/1)  is  151%  faster  than  an 
Am29200  using  a  3/2  DRAM-only  system. 

It  is  possible  to  build  SRAM  based  systems  using  an  Am29240  processor. 
Shown  in  Table  8-3  are  the  required  memory  access  times  for  Am2924x  ROM  space 
memory.  The  table  is  based  on  preliminary  AMD  data  which  may  change  in  the 
future.  The  1/1  access  times  are  given  under  the  0-Wait  column.  At  20  MHz  a  9  ns 
access  time  must  be  supported.  This  is  difficult  to  achieve,  and  probably  not 
worthwhile  economically.  In  practice,  it  would  be  better  to  slow  the  clock  speed 
down  to  19.2  MHz  and  use  10  ns  SRAM  devices. 

However,  the  Am29240  system  using  32-bit  2/1  DRAM-only  has  76%  of  the 
performance  of  a  32-bit  1/1  SRAM  system.  The  performance  benefit  of  SRAM, 
relative  to  DRAM,  is  diminished  when  used  with  an  Am2924x  microcontroller.  This 
is  partly  due  to  the  2-cycle  requirement  for  all  data  writes  performed  to  ROM  space. 
The  1/1  access  is  only  achieved  with  instruction  fetching  and  data  reading.  All  data 
writes  are  performed  with,  at  best,  2/2  access  times.  Conversely,  DRAM  supports  2/1 
for  all  types  of  access. 


25  MHz  Memory  Systems 

The  performance  of  25  MHz  memory  systems  is  shown  in  Figure  8-3.  These 
systems  can  only  be  built  using  Am29240  and  Am29243  microcontrollers.  At  this 
speed  it  is  not  possible  to  use  1-cycle  first  access  memory.  And,  2/1  SRAM  has 
poorer  performance  than  2/1  DRAM  due  to  the  2-cycle  data- write  limitation. 

Scalable  Clocking  is  not  available  at  20  MHz  and  above.  Hence,  all  memory 
systems  must  run  at  the  speed  of  the  processor.  The  fast  (2/1)  DRAM  controller 
incorporated  into  the  Am2924x  microcontrollers  makes  DRAM  the  correct  memory 
choice  with  these  processors.  Additionally,  the  2/2  ROM  which  could  be  used  with 
such  systems  would  degrade  performance  from  a  DRAM-only  system.  Hence,  it 
makes  sense  to  use  only  a  slow  8-bit  ROM  to  initialize  the  DRAM.  Program  code, 
and  initialized  data,  should  be  transferred  from  narrow  ROM  to  DRAM  during 
program  initialization.  If  a  program  is  too  large  to  fit  within  a  single  8-bit  ROM,  it 
would  then  make  sense  to  use  16-bit  ROM  for  additional  capacity. 
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Figure  8-3.  29K  Microcontrollers  Running  the  LAPD  Benchmark 

With  25  MHz  Memory  Systems 

When  executing  from  DRAM  there  is  always  the  danger  of  accidentally  writing 
to  memory  holding  instructions  and  damaging  the  program.  This  can  be  avoided  by 
using  the  on-chip  MMU  to  protect  the  relevant  memory  regions  (see  section  7.4.5). 

33  MHz  Memory  Systems 

The  performance  of  systems  operating  at  33  MHz  is  shown  in  Figure  8-4.  As 
with  25  MHz  systems,  2/1  DRAM-only  memory  is  most  practical.  At  25  and  33  MHz 
the  choice  of  systems  which  are  practical  is  limited  compared  to  the  selection  at  16  or 
even  20  MHz.  In  fact  the  word  “practical”  should  not  be  interpreted  to  mean  easy  or 
readily  available.  At  33  MHz  a  DRAM  system  is  challenged  to  meet  the  2-1  timing 
specification.  Currently  only  the  fastest  DRAM  devices  are  usable.  For  example, 
DRAM  with  60  ns  access  times  is  required  by  25  MHz  systems.  The  most  practical 
way  to  use  33  MHz  processors  is  with  Scalable  Clocking  which  reduces  the  memory 
system  speed  to  16  MHz.  At  these  higher  clock  rates,  the  Am29240  microcontroller 
is  able  to  perform  as  well  as  many  systems  build  around  a  2-bus  microprocessor. 

Further  Observations 

Memory  system  requirement  is  not  likely  to  be  the  only  influence  on  processor 
selection.  The  Am2924x  microcontrollers  offer  additional  on-chip  peripherals 
compared  to  the  Am2920x  processors.  This  may  direct  processor  selection  towards 
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Figure  8-4.  29K  Microcontrollers  Running  the  LAPD  Benchmark 

With  33  MHz  Memory  Systems 

the  more  expensive  Am2924x  grouping.  Additionally  the  Am2924x  processors  have 
additional  power  saving  features.  Note,  as  a  means  of  saving  power,  the  Am2920x 
can  be  temporarily  clocked  down  to  8  MHz,  and  when  necessary  the  clock  returned  to 
the  normal  (higher)  operating  speed. 

All  microcontrollers  are  able  to  use  interrupt  context  caching  (see  section  2.5.4). 
This  improves  interrupt  processing  and  is  somewhat  independent  of  the  off-chip 
memory  system  performance.  With  interrupt  context  caching,  the  processor  state  is 
saved  and  restored  from  on-chip  registers  rather  than  external  memory  stack.  Hence, 
even  the  least  expensive  system  can  support  interrupts  with  a  performance  matching 
that  of  the  more  expensive  systems. 

8.1.2  Moving  up  to  an  Am2920x  Microcontroller 

This  section  presents  the  performance  of  Am2920x  microcontrollers  operating 
at  12.5  and  16  MHz.  The  intention  is  to  evaluate  the  smallest,  least  expensive  systems 
possible.  This  section  should  be  of  interest  to  the  designer  looking  to  use  a  RISC 
microcontroller  to  upgrade  a  system  which  would  have  previously  used  an 
inexpensive  CISC  processor.  The  performance  of  various  8-bit  and  16-bit  memory 
systems  is  shown  on  Figure  8-5. 

Much  of  the  information  presented  in  the  previous  16  MHz  microcontroller 
section  is  applicable  to  the  low  cost  systems  studied  here.  There  are  three  systems  of 
primary  interest.  These  systems  can  be  constructed  at  both  frequencies  of  interest. 
First,  systems  which  operate  with  16-bit  DRAM-only  including  all  systems  which 
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have  slow  ROM,  or  ROMs  which  are  only  8-bits  wide.  Having  slow  or  narrow  ROM 
can  help  to  keep  the  system  cost  down.  The  program  must  be  copied  from  ROM  to 
DRAM  after  processor  power  up.  Hence,  the  DRAM  is  the  only  memory  which 
influences  program  execution  speeds.  Unlike  Am2924x  microcontrollers  the 
Am2920x  processors  have  no  Translation  Look-Aside  Buffers  (TLBs). 
Consequently,  they  can  not  protect  the  DRAM  from  accidental  damage  during 
program  execution  This  may  be  more  of  an  issue  during  code  development  than  in  a 
final  production  product. 

It  is  important  to  note  here  that  the  Am29205  processor  does  not  have  a 
BOOTW  (boot  width)  pin,  and  hence  must  initially  operate  from  16-bit  wide 
memory.  Only  the  Am29200  processor  operation  can  be  initiated  from  8-bit  ROM. 
Consequently,  DRAM-only  systems  are  more  applicable  to  the  Am29200.  This  is  a 
little  unfortunate  as  only  the  Am29205  is  available  at  the  lower  cost  12.5  MHz 
frequency.  Highlighted  on  Figure  8-5  are  the  simulation  results  for  an  Am29200 
processor  operating  at  16  MHz  using  DRAM-only  (16  R=*/*  D=3/2). 

The  second  type  of  system  of  interest  uses  16-bit  ROM  (2/2)  with  16-bit 
DRAM  (3/2).  This  is  faster  than  operating  from  DRAM-only.  If  ROM  is  to  be  used  it 
must  at  least  support  2/2  access  or  faster.  Additionally,  it  must  be  16-bits  wide.  If  it  is 
slower  or  narrower  it  is  best  to  execute  from  DRAM-only.  A  12.5  MHz  Am29205 
with  ROM  (2/2)  and  DRAM  (3/2)  has  97%  of  the  performance  of  a  16  MHz 
Am29200  operating  at  16  MHz  with  DRAM-only. 

The  third  type  of  system  of  interest  uses  1/1  SRAM.  Given  the  higher  cost  of 
SRAM  compared  to  DRAM,  this  configuration  is  only  applicable  when  extra 
performance  is  required.  The  SRAM-only  systems  shown  in  Figure  8-5  would 
require  an  8-bit  ROM  for  program  initialization  —  much  the  same  as  DRAM-only 
systems.  The  simulation  results  show  that  a  16-bit  DRAM-only  system  has  only 
79%  of  the  performance  of  an  8-bit  1/1  SRAM  system.  The  8-bit  SRAM  system  has 
2%  more  performance  than  the  16-bit  2/2— 3/2  system  (ROM-DRAM).  The  reason 
for  the  higher  performance  can  be  understood  by  examining  the  number  of  cycles 
required  to  fetch  a  single  32-bit  instruction.  With  a  16-bit  3/2  DRAM-only  system, 
6-cycles  are  required  to  fetch  the  first  instruction;  4-cycles  for  burst-mode  fetched 
instructions.  With  8-bit  1/1  SRAM,  4-cycles  are  required  to  fetch  instructions.  The 
8-bit  SRAM  has  the  advantage. 

Building  a  16-bit  SRAM  system  which  is  1/1  produces  a  system  which  has 
140%  of  the  performance  of  a  16-bit  DRAM-only  system.  At  12.5  MHz, 
single-cycle  access  requires  13  ns  SRAM,  which  is  readily  available.  Simple  SRAM 
based  designs  can  offer  surprisingly  good  performance  but  the  small  size  of  SRAM 
devices  results  in  the  systems  only  being  suitable  for  applications  requiring  small 
amounts  of  memory.  Otherwise  the  cost  of  the  SRAM  is  likely  to  be  prohibitively 
high. 
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8.1.3  Selecting  a  Microprocessor 

The  highest  performance  systems  are  constructed  around  2-bus  29K 
processors.  The  following  subsections  present  the  performances  obtained  for  a 
complete  range  of  29K  processors  running  the  LAPD  benchmark.  Systems  are 
studied  and  grouped  in  subsections  according  to  their  memory  system  speed.  All 
values  were  obtained  using  32-bit  DRAM  or  SRAM  memory  systems.  Processors 
ran,  as  indicated,  at  the  same  speed  as  the  system  memory  or  at  two-times  using 
Scalable  Clocking  technology.  Comparing  the  fastest  and  slowest  systems,  there  is  a 
performance  difference  of  more  than  6-to-l.  There  is  also  a  range  of  inbetween 
systems  which  offer  a  wide  selection  of  performance  configurations. 

DRAM  is  frequently  referred  to  as,  say,  2/1 .  This  assumes  the  often  required 
1-cycle  of  precharge  (RAS  precharge)  is  hidden.  When  DRAM-only  systems  are 
constructed,  the  precharge  encountered  when  accessing  a  new  memory  page  can  not 
always  be  hidden,  and  the  access  is  thus  3/1  rather  than  2/1.  This  is  explained  in 
section  1.14.1  under  the  Am29200  and  Am29205  subheading.  The  previous  section 
on  selecting  a  microcontroller  also  referred  to  DRAM  memory  speeds  without 
including  the  necessary  precharge  time.  The  terminology  is  acceptable  because  the 
precharge  can  frequently  be  hidden  when  the  ROM  region  is  used  in  conjunction  with 
the  DRAM  region.  Consequently  precharge  has  little  effect  on  performance. 
However,  when  DRAM-only  systems  are  constructed,  the  effect  precharge  has  on 
system  performance  is  more  noticeable.  Even  if  a  2/1  DRAM-only  system  suffers  a 

1- cycle  precharge  on  all  new  page  accesses,  thus  resulting  in  a  3/1  access,  it  shall  still 
be  termed  2/1.  Consistently  maintaining  the  same  notation  for  memory  access 
throughout  this  book  helps  with  system  comparisons.  In  summary,  a  2/1  DRAM  has 

2- cycle  initial  access  followed  by  1-cycle  for  same-page  accesses.  With 
DRAM-only  systems  a  2/1  system  equates  to  a  3/1  system  for  all  new  page  accesses. 

With  microcontrollers,  the  access  times  for  the  DRAM  memory  region 
controller  are  built  into  the  Architectural  Simulator.  Constructing  an  event  file  for 
2-bus  processors  is  a  little  more  difficult.  The  event  file  shown  below  describes  a  2/1 
DRAM  system  used  with  a  2-bus  processor.  The  required  precharge  and  refresh 
times  are  included.  These  parameters  are  also  built  into  the  simulation  model  for 
microcontrollers.  If  a  2/1  SRAM  system  was  being  modeled,  the  precharge  and 
refresh  parameters  would  be  omitted.  Note,  in  the  example,  Scalable  Clocking  is  not 
selected. 


; Architectural 

spacerambank 

ramwidth 

ramread 

ramwrite 

ramburst 

rambread 

rambwrite 


Simulator  event 
80000000  100000 
32 
2 
2 

true 

1 

1 


file 

; memory  location 
; 32-bit  DRAM 
;2/l  access 

; burst  enable 
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; no  Scalable  Clocking 
; DRAM  precharge 


halfspeedbus  false 
ramprecharge  1 
rampprecharge  2 
ramrefresh  255  ; DRAM  refresh 

The  LAPD  program  was  linked  and  then  simulated  operating  from  DRAM-  or 
SRAM-only  memory.  Systems  using  mixed  DRAM  and  SRAM  were  not  modeled. 
It  is  certainly  possible  to  use  SRAM;  although  the  cost  is  likely  to  be  relatively  high, 
given  the  generally  large  memory  systems  attached  to  2-bus  processors.  However, 
ROM  devices  are  likely  to  be  used  with  2-bus  processors;  but  since  it  is  difficult  to 
achieve  1-cycle  access  with  ROM,  it  is  likely  that  data  and  code  held  in  ROM  would 
be  transferred  to  DRAM  for  faster  access.  Alternatively,  an  interleaved  ROM  system 
may  be  constructed.  This  would  enable  2/2  ROM  to  achieve  2/1  memory  access; 
resulting  in  much  the  same  performance  as  2/1  DRAM.  In  fact  faster,  as  there  is  no 
need  to  perform  precharge  when  changing  access  to  a  new  memory  page. 

The  Am29050  3-bus  processor  results  have  been  included  with  the  2-bus 
results.  The  Am29050  is  the  only  member  of  the  3-bus  processor  group  which  is 
likely  to  be  selected  for  a  new  design.  A  2-bus  processor  such  as  the  Am29030  offers 
as  much  performance  as  a  3-bus  Am29000;  and  also  offers  a  simpler  system  design, 
as  well  as  an  easier  upgrade  path.  The  inclusion  of  a  floating-point  execution  unit 
within  the  Am29050  makes  it  an  attractive  choice  for  floating-point  intensive 
operations.  Two-bus  processors  perform  floating-point  operations  using  emulation 
routines  generally  accessed  via  traps.  This  is  slower  than  the  direct  execution 
achieved  with  the  Am29050.  However,  the  Am29040  can  perform  integer  multiply 
directly  in  hardware  (other  2-bus  processors  use  emulation),  this  assists  the  task  of 
emulating  missing  floating-point  operations.  Consequently,  the  Am29040  processor 
is  an  alternative  to  the  Am29050  when  an  application  requires  fast  floating-point 
support.  The  LAPD  benchmark  does  not  contain  any  floating-point  operations. 
Hence,  it  is  not  a  suitable  benchmark  for  evaluating  processors  for  floating-point 
operation. 

16  MHz  Memory  Systems 

This  is  the  entry  level  system  speed  for  2-bus  processors.  The  Am29035  is  the 
only  processor  offered  at  this  speed.  However,  using  Scalable  Clocking  an  Am29030 
or  Am29040  operating  at  33  MHz  internally,  can  be  combined  with  a  16  MHz 
memory  system.  When  Scalable  Clocking  is  used,  off-chip  instruction  and  data 
access  is  performed  at  the  slower  16  MHz  memory  system  speed.  The  higher  cost  of 
the  faster  processors  makes  these  systems  more  expensive,  but  a  considerably  more 
powerful  systems  is  achieved.  Figure  8-6  shows  performance  results  for  various 
systems  based  on  16  MHz  memory  systems. 

The  performance  of  the  systems  improve  in  accordance  with  the  availability  of 
faster  memory.  Given  that  all  2-bus  29K  processors  contain  instruction  cache, 
off-chip  access  is  mainly  required  for  instruction  cache  reload.  Caches  which 
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support  only  one  valid-bit  per  block,  always  fetch  a  complete  block  before  marking 
the  block  valid.  The  Am29040  which  has  a  valid-bit  for  each  entry  in  the  block, 
fetches  only  the  instructions  required  for  execution.  Table  8-4  gives  the  cache  reload 
time  for  various  memory  systems,  assuming  a  complete  block  is  fetched.  DRAM 
access  times  include  precharge  time. 

Relative  system  performance  is  ordered  in  accordance  with  the  cache  block 
reload  times  shown  in  Table  8-4.  Using  an  Am29035,  the  slowest  3/2  DRAM  system 
has  70%  of  the  performance  of  the  fastest  2/1  DRAM  system.  However,  Scalable 
Clocking  enables  the  Am29035,  to  be  replaced  with  an  Am29030  which  improves 
the  performance  of  the  2/1  system  by  93%.  Selecting  an  Am29040  improves  the 
performance  by  192%.  The  33  MHz  processors  cost  more  but  they  enable  three 
price-performance  combinations  by  merely  replacing  the  pin-compatible 
processors  and  doubling  the  clock  frequency. 

At  16  MHz  the  62.5  ns  clock  period  makes  a  2/1  DRAM  design  feasible  using 
readily  available  80  ns  DRAMs.  During  a  page  mode  access  RAS  is  held  active,  and 
the  access  time  after  asserting  the  CAS  is  typically  30  ns.  Assuming  CAS  precharge 
times  of  15  ns  (the  interval  CAS  is  held  high  between  consecutive  accesses),  the  time 
required  to  complete  an  access  within  the  page  is  45  ns.  In  practice  it  is  likely  to  be  a 
little  more  due  to  support  circuity  such  as  PAL  devices.  This  adds  an  additional  5  to 
10  ns  to  the  access  time.  Hence,  the  total  access  time  for  the  current  page  is  about  55 
ns,  under  the  62.5  ns  necessary  for  1-cycle  burst  access.  Allowing  2-cycle  first 
access  provides  125  ns,  which  exceeds  the  typical  80  ns  access  time  for  DRAM.  Of 
course,  if  precharge  is  required  the  RAS  must  be  held  high  for  an  additional  60  ns 
(typically)  before  a  new  page  address  is  established  and  RAS  brought  low.  More 
sophisticated  systems  can  bring  the  RAS  signal  high  while  CAS  is  still  asserted.  This 
enables  RAS  precharge  to  start  early  and  can  help  eliminate  all  or  some  of  the  RAS 
precharge  time.  The  results  for  a  2/1  DRAM-only  system  are  highlighted  in 
Figure  8-6. 


Table  8-4.  Cache  Block  Reload  Times  for  Various  Memory  Types 


Memory  Device 

Access  Time 
(cycles) 

Time  to  reload  a 
cache  block  (cycles) 

DRAM 

3/2 

(3+11+2+2+2=  10 

DRAM 

2/2 

(2+11+2+2+2  =  9 

SRAM 

3/2 

(3)+2+2+2  =  9 

DRAM 

3/1 

(3+11+1+1+1=7 

DRAM 

2/1 

(2+11+1+1+1  =  6 

SRAM 

2/1 

(2)+l+l+l  =  5 
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The  instruction  cache  hit-ratio  is  47.7%  for  the  Am29035  using  a  2/1  DRAM 
system.  The  Am29035  has  the  smallest  instruction  cache,  4k  bytes.  The  Am29030 
and  Am29040  have  an  instruction  cache  hit-ration  of  95.1%.  This  is  due  to  their 
larger  8K  byte  cache. 

20  MHz  Memory  Systems 

The  performance  of  processors  operating  with  20  MHz  memory  systems  is 
shown  in  Figure  8-7.  The  relatively  inexpensive  Am29035  processor  is  not  available 
at  this  or  higher  frequencies.  With  the  Am29030  processor,  the  3/2  SRAM  system  is 
faster  than  the  3/1  DRAM.  This  is  due  to  the  higher  cache  hit-ratio  of  the  Am29030 
compared  to  the  Am29035.  The  larger  cache  reduces  the  impact  cache  reload  has  on 
performance.  However,  it  reveals  the  cost  of  data  memory  access  which  frequently 
occurs  with  the  LAPD  benchmark.  To  access  a  single  32-bit  data  object  costs 
4-cycles  (1-cycle  precharge  plus  3-cycle  access).  The  SRAM  only  requires 
3-cycles  for  the  same  task.  Consequently,  the  SRAM,  although  it  can  not  sustain 
1-cycle  burst,  is  faster.  Using  the  Am29040  processor,  the  3/1  DRAM  is  again  a 
better  choice  than  3/2  SRAM.  This  is  a  result  of  the  on-chip  data  cache  reducing  the 
effects  of  memory  precharge  for  data  accesses. 

The  50  ns  period  of  a  20  MHz  system  clock  makes  1-cycle  burst-mode  access 
possible  using  slightly  faster  DRAM.  Additionally,  2-cycle  first  access  (100  ns)  is 
possible  with  80  ns  DRAM.  The  60  ns  RAS  precharge  can  be  hidden  if  DRAM  is 
combined  with  ROM  —  the  precharge  occurring  during  ROM  access.  Alternatively, 
DRAM  access,  including  RAS  precharge,  is  140  ns  (80+60)  which  is  a  really  3-cycle 
for  an  initial  new  page  access  (150  ns). 

The  results  for  2/1  and  2/2  DRAM-only  systems  are  highlighted  on  Figure  8-7. 
Using  an  Am29030  processor,  a  2/1  system  has  15%  more  performance  than  the  2/2 
system.  The  difference  is  24%  when  using  an  Am29040  processors.  Scalable 
Clocking  enables  the  Am29030  to  be  replaced  with  an  Am29040  which  improves  the 
performance  of  the  2/2  system  by  96%.  With  a  2/1  system,  the  performance  is 
improved  by  112%. 

25  MHz  Memory  Systems 

The  performance  of  processors  operating  with  25  MHz  memory  systems  is 
shown  in  Figure  8-8.  At  25  Mhz  the  cycle  time  is  reduced  to  40  ns.  This  makes 
1-cycle  burst-mode  access  difficult  to  achieve.  Lets  look  at  the  arithmetic;  fast 
DRAMs  have  an  access  time  of,  say,  20  ns  from  CAS  assertion,  and  10  ns  CAS 
precharge.  Let’s  also  assume  fast  5  ns  PAL  logic,  and  a  best-case  input  setup  time  of  6 
ns  (Am29040, 12  ns  for  the  Am29030).  This  results  in  a  total  access  time  of  41  ns  (47 
ns  for  the  Am29030).  This  makes  1-cycle  burst-mode  impossible  at  25  MHz 
without  an  interleaved  memory  system.  However,  an  Am29040  could  be  operated  at 
24  MHz  and  just  achieve  the  timing  requirements  for  1-cycle  burst-mode  access.  An 
initial  2-cycle  access  time  can  not  be  achieved  with  80  ns  DRAM.  This  would  require 
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Figure  8-7.  29K Microprocessors  Running  the  LAPD  Benchmark 
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The  LAPD  benchmark  results  indicate  that  a  2/1  DRAM-only  system  is  27% 
faster  than  a  3/2  DRAM-only  system.  The  interleaved  3/1  DRAM-only  system  is 
14%  faster  than  the  non-interleaved  3/2  DRAM-only  system,  and  23%  faster  for  the 
clock  scaled  systems  based  on  the  Am29040.  The  results  for  3/1  and  3/2 
DRAM-only  systems  are  highlighted  on  Figure  8-8.  Scalable  Clocking  enables  the 
Am29030  to  be  replaced  with  an  Am29040  which  improves  the  performance  of  the 
3/2  system  by  110%.  With  a  3/1  system,  the  performance  is  improved  by  128%. 

Note  also  that  a  system  based  on  an  Am29030  using  3/2  DRAM-only  at  25 
MHz,  has  2%  lower  performance  than  an  Am29030  using  2/1  DRAM-only  at  20 
MHz.  Consequently,  in  this  case,  there  is  a  performance  advantage  obtainable  by 
reducing  the  memory  system  operating  frequency. 

33  MHz  Memory  Systems 

The  performance  of  processors  operating  with  33  MHz  memory  systems  is 
shown  in  Figure  8-9.  Scalable  Clocking  is  no  longer  available  with  33  MHz  memory 
systems.  This  places  33  MHz  systems  in  competition  with  slower  memory  systems 
using  clock  scaled  processors.  It  is  not  likely  that  construction  of  a  memory  system 
faster  than  33  MHz  would  be  practical.  Clock  rates  such  as  40  MHz  would  require 
additional  memory  wait  states.  Consequently,  the  system  performance  would  be  no 
faster  than  a  slower  memory  system  operating  with  fewer  memory  access  wait  states. 

Only  the  very  fastest  DRAMs  and  support  circuitry  could  achieve  1-cycle 
burst-mode  access  without  using  an  interleaved  DRAM  design.  Achieving  2-cycle 
initial  access  is  also  unlikely  given  the  31.25  ns  Clock  period.  A  3-cycle  (93.75  ns) 
first  access  seems  more  likely.  The  results  for  3/1  and  3/2  DRAM-only  systems  are 
highlighted  on  Figure  8-9. 

The  complexity  of  building  a  3/1  rather  than  a  3/2  memory  system  at  33  Mhz 
must  be  justified  by  a  sufficient  performance  gain.  Using  an  Am29030  processor  a 
performance  gain  of  13.6%  is  shown.  With  an  Am29040,  the  performance  gain  is 
17.2%.  However,  using  Scalable  Clocking,  an  Am29040  (50  MHz)  using  3/1  DRAM 
at  25  MHz  is  41%  faster  than  an  Am29040  using  3/2  DRAM  at  33  MHz;  and  20% 
faster  than  an  Am29040  using  faster  3/1  DRAM  at  33  MHz.  An  Am29040  operating 
at  50  MHz  with  a  2/1  DRAM  system  operating  at  25  MHz  is  the  fastest  DRAM-only 
system  benchmarked. 

Additional  Observations. 

Setting  the  Am29050  aside  due  to  its  enhanced  floating-point  processing 
capability,  system  performance  is  likely  to  be  the  main  influence  on  processor 
selection.  However,  if  low  power  consumption,  3.3-volt  operation,  or 
multi-processor  cache  consistency  are  important  issues,  then  the  Am29040 
processor  is  the  clear  choice.  Scalable  Clocking  technology  clearly  offers  a  major 
performance  enhancement,  but  it  does  require  the  purchase  of  a  more  expensive 
processor.  This  may  be  offset  by  a  simpler  memory  system  design  which  uses  less 
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Figure  8-9.  29K Microprocessors  Running  the  LAPD  Benchmark 

with  33  MHz  Memory  Systems 


expensive  DRAM  components.  Scalable  Clocking  and  processor  pin-compatibility 
does  simplify  the  task  of  creating  multiple  price-performance  system  configurations 
—  using  a  single  board  design. 

All  29K  processors  are  able  to  use  interrupt  context  caching  (see  section  2.5.4). 
This  improves  interrupt  processing  and  is  somewhat  independent  of  the  off-chip 
memory  system  performance.  As  with  microcontrollers,  even  the  least  expensive 
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2-bus  microprocessor  can  support  interrupts  with  a  performance  matching  that  of  the 
more  expensive  systems. 

The  LAPD  benchmark  does  not  evaluate  all  of  the  performance  parameters 
which  are  important  to  a  real-time  system.  Performance  parameters  such  as  task 
context  switch  time  may  also  be  important.  Many  non-application  benchmarks  are 
concerned  with  movement  of  large  amounts  of  data  —  such  as  context  saving  and 
restoring.  Achieving  single-cycle  burst-mode  for  data  access  will  greatly  improve 
data  move  operations  which  use  store-  and  load-multiple  operations. 

8.1.4  Reducing  the  Register  Window  Size 

As  explained  in  section  2.3.2,  the  “lregs=n”  switch  can  be  used  with  the  High  C 
29K  compiler  to  reduce  the  size  of  the  register  stack  cache.  That  is,  not  all  of  the  128 
local  register  file  need  be  allocated  to  caching  the  top  of  the  run-time  register  stack. 

The  register  cache  is  often  referred  to  as  the  register  window  (window  size  =  rfb 
-  rab ,  see  section  2.1.1).  A  reduced  window  size  would  increase  the  frequency  of 
stack  spilling  and  filling,  and  consequently  reduce  the  processors  effectiveness  at 
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executing  application  code.  However,  in  certain  circumstances,  task  context  switch 
times  are  reduced  when  operating  with  a  smaller  window  size. 

First,  let’s  look  at  the  loss  of  performance  resulting  from  adjusting  the  window 
size.  The  maximum  window  size  is  128,  and  the  minimum  is  36  (a  restriction  set  by 
the  “lregs”  compiler  option).  Figure  8-10  shows  performance  results  for  an 
Am29040  microprocessor  operating  at  50  MHz  with  a  25  MHz  2/1  DRAM-only 
memory  system.  With  the  window  reduced  to  64,  which  is  half  its  normal  size  of  128 
registers,  a  performance  reduction  of  9.2%  was  measured.  With  a  window  size  of 
only  48,  the  performance  is  reduced  by  15%. 

The  same  LAPD  performance  measurements  were  acquired  with  an  Am29200 
microcontroller  operating  with  3/2  DRAM.  The  results  are  shown  on  Figure  8-11. 
The  performance  drop-off  with  reduced  window  size  is  even  less  pronounced.  With  a 
window  size  of  64,  the  reduction  is  only  5.2%,  and  10%  with  a  window  size  of  48. 

The  load-  and  store-multiple  operations  used  by  spill  and  fill  handlers  are 
performed  directly  to  memory.  On-chip  data  cache  has  no  effect  on  spill/fill  data 


Memory  is  3/2  DRAM  operating  at  20  MHz 

Figure  8-11.  Am29200  Microcontroller  Running  the  LAPD  Benchmark 

with  Various  Register  Stack  Window  Sizes 
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movement  (at  least  with  these  two  processors).  Assuming  an  adequate  cache  hit  ratio, 
the  Am29040  processor  frequently  executes  instructions  at  the  processor  speed  of  50 
MHz.  However,  the  load  and  store  data  transfers  which  occur  during  spilling  and 
filling  are  performed  at  the  memory  system  speed  of  25  MHz.  Hence,  as  spilling  and 
filling  become  more  frequent,  the  effective  operating  frequency  is  reduced.  This 
explains  why  the  Am29040  performance,  using  2x  scalable  clocking,  suffers  a  more 
marked  reduction  than  observed  with  the  Am29200  microcontroller. 


Perm 


Towers 


Queens 


Intmn 


Quick 


Bubble 


Execution  time  in  seconds 

Am29040  operating  at  50  MHz,  Memory  is  2/1  DRAM  operating  at  25  MHz 

Figure  8-12.  Am29040  Microprocessors  Running  the  Stanford  Benchmark 

with  Various  Register  Stack  Window  Sizes 
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Some  applications  may  be  more,  or  less,  affected  by  changes  in  the  window  size; 
it  very  much  depends  on  the  procedure’s  register  requirements  and  on  the  level  of 
procedure  nesting.  To  study  these  effects  further,  the  Am29040  based  system  was 
used  to  run  the  Stanford  benchmark.  Separate  results  for  six  of  the  integer  routines 
taken  from  the  integer-part  of  the  Stanford  code  are  shown  on  Figure  8-12.  The 
routines  were  chosen  because  of  their  diversity  in  function  and  similarity  in 
execution  times.  This  similarity  made  for  clearer  scaling  and  hence  easier 
comparison  of  the  results. 

Routines  which  have  a  small  register  requirement  are  unaffected  by  the 
reduction  in  window  size.  The  Towers  and  Queens  tests  are  supported  by  recursive 
procedure  calls.  Consequently,  these  routines  show  a  marked  loss  of  performance 
when  operating  with  reduced  window  sizes.  The  reduction  in  performance  is  less 
than  one  might  expect.  Applications  which  have  a  small  dynamic  register-stack 
requirement,  experience  moderate  loss  of  performance  when  operating  with  a 
reduced  window  size.  However,  future  29K  processors  which  use  3x  or  4x  scalable 
clocking  technology  and  superscalar  execution  are  likely  to  show  a  relatively  greater 
loss  of  performance  with  reduced  window  sizes.  At  higher  execution  speeds,  the  cost 
of  going  off-chip  is  increased. 

We  have  looked  at  the  loss  of  performance  associated  with  reduced  window 
sizes  but  what,  if  any,  are  the  benefits?  It  was  already  stated  that  task  context 
switching  can  be  improved.  This  is  true,  but  needs  further  explanation.  Most 
operating  system  manufacturers  provide  basic  context  switch  times  for  benchmarks 
run  on  their  product.  These  benchmarks  typically  indicate  a  raw  context  switch  time 
of  10  to  20  micro  seconds.  Longer  or  shorter  times  are  possible  depending  on  the 
implementation  and  the  speed  of  the  system  memory.  Benchmark  programs  usually 
measure  synchronous  context  switch  times;  these  are  shorter  than  asynchronous 
switch  times.  When  a  synchronously  saved  context  is  switched-in,  only  the  current 
activation  record  need  be  restored  in  the  register  cache  (typically  12  registers).  With 
an  asynchronously  saved  context,  the  register  cache  must  be  restored  to  the  position 
at  which  the  context  was  saved  (several  activation  records).  Hence,  asynchronous 
switches  take  longer  than  synchronous  switches.  How  much  longer  depends  on  a 
number  of  factors. 

An  operating  system  may  be  written  in  C,  and  the  context  switch  code  may 
occur  at  a  depth  of  several  levels  of  procedure  nesting.  As  well  as  these  operating 
system  related  activation  records,  the  register  cache  will  contain  activation  records 
relating  to  the  application  task.  It  is  not  possible  to  state,  in  general,  just  how  much  of 
the  register  cache  is  in  use  at  the  point  the  context  switch  occurs;  but  certainly  the 
worst-case  condition  is  known.  The  maximum  number  of  local  registers  which 
would  require  saving  or  restoring  is  limited  to  the  window  size.  Consequently, 
reducing  the  window  size,  reduces  the  worst-case  context  switch  time.  It  will  have 
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less  of  an  effect  on  average  context  switch  times;  and  little,  if  any,  on  synchronous 
context  switch  times. 

Task  context  switching  is  not  a  frequently  occurring  activity,  compared  to 
procedure  call  and  returns.  Reducing  the  window  size  may  improve  context  switch 
times,  but  at  the  cost  of  increasing  the  overhead  associated  with  a  more  frequently 
occurring  event  —  spilling  and  filling.  However,  for  those  systems  which  perform  an 
unusually  high  number  of  task  context  switches,  say  20,000  per  second  (one  every  50 
micro  seconds),  it  may  make  sense  to  operate  with  a  reduced  window  size. 
Figure  8-13  shows  the  amount  of  time  removed  from  the  worst-case  (save  and 
restore)  task  context  switch  times  for  various  window  sizes.  Two  memory  systems 
are  presented;  a  1-cycle  memory  at  25  MHz,  and  a  2-cycle  memory  at  16  MHz.  As 
the  window  size  is  reduced,  the  savings  increase. 


0  16  32  48  64  80  96  112  128 

Window  Size 

Figure  8-13.  Reduction  In  Worst-Case  Asynchronous  Task  Context  Switch  Times 
with  Various  Register  Stack  Window  Sizes 

Reducing  the  window  size  can  be  a  benefit  when  interrupt  processing.  When  a 
User  mode  program  is  interrupted,  it  is  normal  to  save  the  interrupted  program’s 
context  in  the  Process  Control  Block  (PCB),  (see  section  5.4).  However,  if  no  task 
context  switch  will  occur  as  a  result  of  the  interrupt,  then  it  is  faster  to  save  the  context 
in  the  available  register  cache  space. 

Similarly,  when  a  Supervisor  mode  program  is  interrupted,  context  is  saved  on  a 
memory  stack.  This  is  the  approach  taken  by  the  non-multitasking  OS-boot 
operating  system  supplied  by  AMD.  Appendix  B  presents  the  interrupt  context  save 
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code  which  can  be  used  by  a  HIF  conforming  operating  systems  to  prepare  for  signal 
processing.  The  example  code  saves  context  at  the  current  memory  stack  position.  A 
total  of  51  32-bit  words  are  required  to  save  the  complete  context.  If  the  cache 
window  were  reduced  to  128-52=76  registers,  there  would  be  sufficient  space  to 
save  the  interrupted  context  in  the  register  file.  The  registers  in  the  space  (rjb- 128)  to 
mb  are  not  part  of  the  window.  In  fact,  registers  below  rsp  ( rsp  to  rfb-128)  are  always 
available  for  interrupt  handler  use. 
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Appendix  A 


HIF  Service  Calls 


A.1  Service  Call  Numbers  And  Parameters 

This  section  describes  in  detail  each  HIF  2.0  service.  Service  calls  use  local  reg¬ 
isters  to  pass  parameters  to  the  operating  system,  global  registers  are  used  to  return 
results.  Example  code  sequences  are  given  for  each  service.  However,  user  code  does 
not  normally  invoke  HIF  services  directly.  It  is  more  likely  an  assembly  language 
glue  routine  or  other  library  routine  will  be  used  to  access  the  service.  AMD  supplies 
libraries  of  the  necessary  glue  code.  Chapter  2  introduced  the  HIF  concepts  and  has 
an  overview  of  its  services. 
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Service  1  -  exit 


Terminate  a  Program 


Description 

This  service  terminates  the  current  program  and  returns  a  value  to  the  system 
kernel,  indicating  the  reason  for  termination.  By  convention,  a  zero  passed  in  lr2  in¬ 
dicates  normal  termination,  while  any  non-zero  value  indicates  an  abnormal  termina¬ 
tion  condition.  There  are  no  returned  values  in  registers  gr96  and  grl21  since  this 
service  does  not  return. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

gr!21 

1  (0x1) 

Service  number 

lr2 

exitcode 

User-supplied  exit  code 

Returns: 

gr96 

undefined 

This  service  call  does  not  return 

grl21 

undefined 

This  service  call  does  not  return 

Example  Call 

const 

lr2,  1 

; exit  code  =  1 

const 

grl2 1 ,  1 

; service  =  1 

asneq 

69, grl , grl 

;call  the  operating  system 

In  the  above  example,  the  operating  system  kernel  is  being  called  with  service 
code  1  and  an  exit  code  of  1 ,  which  is  interpreted  according  to  the  specifications  of  the 
individual  operating  system.  The  value  of  the  exit  code  is  not  defined  as  part  of  the 
HIF  specification. 

In  general,  however,  an  exit  code  of  zero  (0)  specifies  a  normal  program  ter¬ 
mination  condition,  while  a  non-zero  code  specifies  an  abnormal  termination  result¬ 
ing  from  detection  of  an  error  condition  within  the  program. 

Programs  can  terminate  normally  by  falling  through  the  curly  brace  at  the  end  of 
the  main  function  in  a  C  language  program.  Other  languages  may  require  an  explicit 
call  to  the  kernel’s  exit  service. 
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Open  a  File 


Service  17  -  open 

Description 

This  service  opens  a  named  file  in  a  requested  mode.  Files  must  be  explicitly 
opened  before  any  read,  write,  close,  or  other  file  positioning  accesses  can  be  ac¬ 
complished.  The  open  service,  if  successful,  returns  an  integer  token  that  is  used  to 
refer  to  the  file  in  all  subsequent  service  requests.  In  many  high-level  languages  the 
returned  token  is  referred  to  as  a  file  descriptor.  File  names  are  generally  not  portable 
from  one  implementation  to  another.  In  some  cases,  names  can  be  made  more  porta¬ 
ble  by  limiting  them  to  six  or  fewer  upper-case  alphabetic  characters,  or  by  using  the 
tmpnam  HIF  service  (33)  to  create  names  that  conform  to  the  current  implementa¬ 
tion’s  file  system  requirements. 

Environment  variables  can  also  be  used  to  specify  legal  file  names  for  applica¬ 
tion  programs  wishing  to  conform  to  the  requirements  of  a  particular  HIF  imple¬ 
mentation.  The  getenv  service  (65)  provides  the  means  to  associate  a  file  name  or 
pathname  with  a  mnemonic  reference.  This  is  the  most  portable  means  to  specify 
pathnames  for  implementations  that  incorporate  the  getenv  service. 

The  HIF  specification  intentionally  refrains  from  defining  the  constituents  of  a 
legal  pathname,  or  any  intrinsic  characteristics  of  the  implemented  file  system.  In 
this  regard,  the  only  requirement  of  a  HIF-conforming  kernel  is  that  when  the  open 
service  is  successfully  performed,  that  the  routine  returns  a  small  integer  value  that 
can  be  used  in  subsequent  input/output  service  calls  to  refer  to  the  opened  file. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

gr!21 

17  (Oxll) 

Service  number 

lr2 

pathname 

A  pointer  to  a  filename 

lr3 

mode 

See  parameter  descriptions  below. 

lr4 

pflag 

See  parameter  descriptions  below. 

Returns: 

gr96 

fileno 

Success:  >  0  (file  descriptor) 

Failure:  <  0 

gr!21 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

Parameter  Descriptions 

Pathname  is  a  pointer  to  a  zero-terminated  string  that  contains  the  full  path  and 

name  of  the  file  being  opened.  Individual  operating  systems  have  different  means  to 
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specify  this  information.  With  hierarchical  file  systems,  individual  directory  levels 
are  separated  with  special  characters  that  can  not  be  part  of  a  valid  filename  or  direc¬ 
tory  name.  In  UNIX-compatible  file  systems,  directory  names  are  separated  by  for¬ 
ward  slash  characters,  /  (e.g.,  lusr  /jack  /files  / myfile );  where  usr,jack,  and  files  are 
succeedingly  lower  directory  levels,  beginning  at  the  root  directory  of  the  file  sys¬ 
tem.  The  name  myfile  is  the  filename  to  be  opened  at  the  specified  level.  The  individ¬ 
ual  characteristics  of  files  and  pathnames  are  determined  by  the  specifications  of  a 
particular  operating  system  implementation. 

The  mode  parameter  is  composed  of  a  set  of  flags,  whose  mnemonics  and 
associated  values  are  listed  in  Table  A-l. 


Table  A-l.  HIF  Open  Sendee  Mode  Parameters 


Name 

Value 

Description 

0_RD0NLY 

0x0000 

Open  for  read  only  access 

CM/VRONLY 

0x0001 

Open  for  write  only  access 

0_RDWR 

0x0002 

Open  for  read  and  write  access 

0_APPEND 

0x0008 

Always  append  when  writing 

0„N DELAY 

0x0010 

No  delay 

0_CREAT 

0x0200 

Create  file  if  it  does  not  exist 

0_TRUNC 

0x0400 

If  the  file  exists,  truncate  it  to  zero  length 

0_EXCL 

0x0800 

Fail  if  writing  and  the  file  exists 

0_F0RM 

0x4000 

Open  in  text  format 

The  0_RD0NLY  mode  provides  the  means  to  open  a  file  and  guarantee  that 
subsequent  accesses  to  that  file  will  be  limited  to  read  operations.  The  operating  sys¬ 
tem  implementation  will  determine  how  errors  are  reported  for  unauthorized  opera¬ 
tions.  The  file  pointer  is  positioned  at  the  beginning  of  the  file,  unless  the  0_AP- 
PEND  mode  is  also  selected. 

The  0_WR0NLY  mode  provides  the  means  to  open  a  file  and  guarantee  that 
subsequent  accesses  to  that  file  will  be  limited  to  write  operations.  The  operating 
system  implementation  will  determine  how  errors  are  reported  for  unauthorized  op¬ 
erations.  The  file  pointer  is  positioned  at  the  beginning  of  the  file,  unless  the  0_AP- 
PEND  mode  is  also  selected. 

The  0_RDWR  mode  provides  the  means  to  open  a  file  for  subsequent  read  and 
write  accesses.  The  file  pointer  is  positioned  at  the  beginning  of  the  file,  unless  the 
0_APPEND  mode  is  also  selected. 

If  0_APPEND  mode  is  selected,  the  file  pointer  is  positioned  to  the  end  of  the 
file  at  the  conclusion  of  a  successful  open  operation,  so  that  data  written  to  the  file  is 
added  following  the  existing  file  contents. 
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Ordinarily,  a  file  must  already  exist  in  order  to  be  opened.  If  the  0_CREAT 
mode  is  selected,  files  that  do  not  currently  exist  are  created;  otherwise,  the  open 
function  will  return  an  error  condition  in  grl21. 

If  a  file  being  opened  already  exists  and  the  0_TRUNC  mode  is  selected,  the 
original  contents  of  the  file  are  discarded  and  the  file  pointer  is  placed  at  the  begin¬ 
ning  of  the  (empty)  file.  If  the  file  does  not  already  exist,  the  HIF  service  routine 
should  return  an  error  value  in  grl21,  unless  0_CREAT  mode  is  also  selected. 

The  0_EXCL  mode  provides  a  method  for  refusing  to  open  the  file  if  the 
0_WR0NLY  or  0_RDWR  modes  are  selected  and  the  file  already  exists.  In  this 
case,  the  kernel  service  routine  should  return  an  error  code  in  grl21. 

0_F0RM  mode  indicates  that  the  file  is  to  be  opened  as  a  text  file,  rather  than  a 
binary  file.  The  nominal  standard  input,  output,  and  error  files  (file  descriptors  0,  1, 
and  2)  are  assumed  to  be  open  in  text  mode  prior  to  commencing  execution  of  the 
user’s  program. 

When  opening  a  FIFO  (interprocess  communication  file)  with  0_RD0NFY  or 
0_WR0NFY  set,  the  following  conditions  apply: 

•  If  0_NDEFAY  is  set  (i.e.,  equal  to  0x0010): 

— A  read-only  open  will  return  without  delay. 

— A  write-only  open  will  return  an  error  if  no  process  currently  has  the  file  open  for 
reading. 

•  If  0_NDEFAY  is  clear  (i.e.,  equal  to  0x0000): 

— A  read-only  open  will  block  until  a  process  opens  a  file  for  writing. 

— A  write-only  open  will  block  until  a  process  opens  a  file  for  reading. 

When  opening  a  file  associated  with  a  communication  line  (e.g.,  a  remote  mo¬ 
dem  or  terminal  connection),  the  following  conditions  apply: 

•  If  0_NDEFAY  is  set,  the  open  will  return  without  waiting  for  the  carrier  detect 
condition  to  be  TRUE. 

•  If  0_NDEFAY  is  clear,  the  open  will  block  until  the  carrier  is  found  to  be  present. 

The  optional pflag  parameter  specifies  the  access  permissions  associated  with  a 
file;  it  is  only  required  when  0_CREAT  is  also  specified  (i.e.,  create  a  new  file  if  it 
does  not  already  exist).  If  the  file  already  exists,  pflag  is  ignored.  This  parameter 
specifies  UNIX-style  file  access  permission  codes  (r,  w,  and  x  for  read,  write,  and 
execute  respectively)  for  the  file’s  owner,  the  work  group,  and  other  users.  If  pflag  is 
-1,  then  all  accesses  are  allowed.  See  the  UNIX  operating  system  documentation  for 
additional  information  on  this  topic. 
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Example  Call 

path : 

.ascii 

"/usr/ jack/ files/myfile\0" 

.  set 

mode , 0_RDWR | 0_ 

CREAT | 0_F0RM 

.  set 

permit,  0x180 

fd: 

.  word 

0 

r 

const 

lr2 , path 

/address  of  pathname 

consth 

lr2 , path 

r 

const 

lr3 , mode 

/open  mode  settings 

const 

lr4 , permit 

/ permissions 

const 

grl2 1 , 17 

/service  =  17  (open) 

asneq 

69, grl , grl 

/perform  OS  call 

jmpf 

grl2 1 ,  open_err 

/  jump  if  error  on  open 

const 

grl20, fd 

/ set  address  of 

consth 

grl2  0 , f d 

/ file  descriptor 

store 

0,0, gr96, grl20 

/ store  file  descriptor 

In  the  above  example,  the  file  is  being  opened  in  read/write  text  mode.  The 
UNIX  permissions  of  the  owner  are  set  to  allow  reading  and  writing,  but  not  execu¬ 
tion,  and  all  other  permissions  are  denied.  As  indicated  above  in  the  parameter  de¬ 
scriptions,  the  file  permissions  are  only  used  if  the  file  does  not  already  exist.  When 
the  open  service  returns,  the  program  jumps  to  the  open_err  error  handler  if  the  open 
was  not  successful;  otherwise,  the  file  descriptor  returned  by  the  service  is  stored  for 
future  use  in  read,  write,  lseek,  remove,  rename,  or  close  service  calls. 

As  described  in  the  introduction  to  these  services,  the  HIF  can  be  implemented 
to  several  degrees  of  elaboration,  depending  on  the  underlying  system  hardware,  and 
whether  the  operating  system  is  able  to  provide  the  full  set  of  kernel  services.  In  the 
least  capable  instance  (i.e.,  a  standalone  board  with  a  serial  port),  it  is  likely  that  only 
the  0_RD0NLY,  0_WR0NLY  and  0_RDWR  modes  will  be  supported.  In  more  ca¬ 
pable  systems,  the  additional  modes  should  be  implemented,  if  possible. 

If  an  error  is  encountered  during  the  execution  of  an  open  call,  no  file  descriptor 
will  be  allocated. 
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Service  18  -  close 


Close  a  File 


Description 

This  service  closes  the  open  file  associated  with  the  file  descriptor  passed  in  lr2. 
Closing  all  files  is  automatic  on  program  exit  (see  exit),  but  since  there  is  an  imple¬ 
mentation-defined  limit  on  the  number  of  open  files  per  process,  an  explicit  close  ser¬ 
vice  call  is  necessary  for  programs  that  deal  with  many  files. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

gr!21 

18  (0x12) 

Service  number 

lr2 

fileno 

File  descriptor 

Returns: 

gr96 

retval 

Success:  =  0 

Failure:  <  0 

grl21 

0x80000000 

errcode 

Fogical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

Example  Call 

fd: 

.  word 

0 

const 

consth 

load 

const 

asneq 

jmpf 

nop 

gr96, fd 
gr96, fd 

0,0, lr2, gr96 
grl2 1 , 18 

69, grl , grl 
grl21, clos_err 

;  set  address  of 
;file  descriptor 
;  get  file  descriptor 
; service  =  18 
; and  call  the  OS 
/handle  close  error 

r 

The  above  example  illustrates  loading  a  previously  stored  file  descriptor  (fd,  in 
this  case)  and  calling  the  kernel’s  close  service  to  close  the  file  associated  with  that 
descriptor.  If  an  error  occurs  when  attempting  to  close  the  file,  the  kernel  will  return 
an  error  code  in  grl21  (the  content  of  that  register  will  not  be  TRUE)  and  the  program 
will  jump  to  an  error  handler;  otherwise,  program  execution  will  continue.  The  file 
will  be  closed  and  the  file  descriptor  deallocated,  even  when  an  error  is  encountered. 
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Read  a  Buffer  of  Data  from  a  File 


Service  19  -  read 

Description 

This  service  reads  a  number  of  bytes  from  a  previously  opened  file  (identified 
by  a  small  integer  file  descriptor  in  lr2  that  was  returned  by  the  open  service)  into 
memory  starting  at  the  address  given  by  the  buffer  pointer  in  lr3.  Register  lr4  con¬ 
tains  the  number  of  bytes  to  be  read.  The  number  of  bytes  actually  read  is  returned  in 
gr96.  Zero  is  returned  in  gr96  if  the  file  is  already  positioned  at  its  end-of-file.  If  an 
error  is  detected,  a  small  positive  integer  is  returned  in  grl21,  indicating  the  nature  of 
the  error. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

19  (0x13) 

Service  number 

lr2 

fileno 

File  descriptor 

lr3 

buffptr 

A  pointer  to  buffer  area 

lr4 

nbytes 

Number  of  bytes  to  be  read 

Returns: 

gr96 

count* 

*See  Return  Value  table,  below. 

gr!21 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 

(implementation  dependent) 

The  value  returned  in  register  gr96  can  be  interpreted  differently,  depending  on 
the  current  operating  mode  of  the  file  identified  by  the  fileno  parameter.  The  operat¬ 
ing  mode  is  established  or  changed  by  invoking  the  ioctl  service  (24).  The  Return 
Value  table  shows  how  the  Return  Value  in  gr96  should  be  interpreted  for  various 
operating  modes. 

Return  Value 


Count  Non-ASYNC  ASYNC  NBLOCK 


count  n/a  count 

EOF  success  EOF 

fail  fail  if  =  -1  &gr121  =  EAGAIN, 

no  data  is  available. 

Otherwise,  fail. 

In  the  Return  Value  table,  for  normal  synchronous  read  service  requests,  the  re¬ 
turn  value  contains  a  count  of  the  number  of  bytes  read  (if  gr96  >  0),  end-of-file  (if 


gr96  >  0 
gr96  =  0 
gr96  <  0 


Appendix  A  HIF  Service  Calls 


457 


gr96  =  0),  or  an  indication  that  the  operation  failed  ( gr96  <  0).  For  ASYNC  mode, 
the  operation  is  only  scheduled  by  invoking  the  read  service,  so  the  return  value  in 
gr96  merely  indicates  that  the  request  succeeded  or  failed.  Non-blocking  read  re¬ 
quests  indicate  that  data  is  to  be  returned  if  available;  otherwise,  the  service  is  to  re¬ 
turn  control  to  the  user  process  with  an  indication  that  the  operation  would  block  if 
allowed  to  continue.  When  gr96  contains  the  value  -1 ,  and  the  errcocle  value  in  regis¬ 
ter  grl21  is  EAGAIN,  then  no  data  is  available  to  be  read.  If  gr96  contains  any  other 
negative  value,  or  if  register  grl21  contains  any  other  error  code,  the  service  request 
was  not  accepted. 

If  the  operating  mode  of  the  file  descriptor  referenced  by  the  read  service  has 
previously  been  set  to  ASYNC  using  the  ioctl  service,  the  iowait  service  should  be 
used  to  test  the  completion  status  of  this  operation,  and  to  access  the  number  of  bytes 
that  have  been  transferred.  If  a  previously  issued  asynchronous  read,  write,  or  lseek 
operation  is  not  complete,  the  current  read  request  will  return  a  failure  status.  Only 
one  outstanding  request  is  allowed. 

If  the  operating  mode  has  previously  been  set  to  NBLOCK  (non-blocking),  the 
count  value  returned  in  gr96  will  only  reflect  the  number  of  bytes  currently  available 
in  the  buffer.  NBLOCK  mode  only  applies  to  terminal-like  devices. 

Example  Call 


.  word 

0 

. block 

256 

const 

grll9, fd 

consth 

grll9, fd 

load 

0 ,  0 ,  lr2 , grl 1 9 

;  get  file  descriptor 

const 

lr3 , buf 

; set  buffer  address 

consth 

lr3 , buf 

r 

const 

lr4, 256 

/specify  buffer  size 

const 

grl21, 19 

/ service  =  19 

asneq 

69,  grl , grl 

/call  the  OS 

jmpf 

nop 

grl21, rd_err 

/handle  read  errors 

The  example  call  requests  the  HIF  to  return  256  bytes  from  the  file  descriptor 
contained  in  the  variable:  /!:/.  If  the  call  is  successful,  grl21  will  contain  a  TRUE  val¬ 
ue  and  gr96  will  contain  the  number  of  bytes  actually  read.  If  the  service  fails,  grl21 
will  contain  the  error  code. 
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Write  a  Buffer  of  Data  to  a  File 


Service  20  -  write 

Description 

This  service  writes  a  number  of  bytes  from  memory  (starting  at  the  address  giv¬ 
en  by  the  pointer  in  lr3)  into  the  file  specified  by  the  small  positive  integer  file  des¬ 
criptor  that  was  returned  by  the  open  service  when  the  file  was  opened  for  writing. 
Register  lr4  contains  the  number  of  bytes  to  be  written.  The  number  of  bytes  actually 
written  is  returned  in  gr96.  If  an  error  is  detected,  grl21  will  contain  a  small  positive 
integer  on  return  from  the  service,  indicating  the  nature  of  the  error. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

20  (0x14) 

Service  number 

lr2 

fileno 

File  descriptor 

lr3 

buffptr 

A  pointer  to  the  buffer  area 

lr4 

nbytes 

Number  of  bytes  to  be  written 

Returns: 

gr96 

count* 

*See  Return  Value  table,  below. 

gr!21 

0x80000000 

Logical  TRUE,  service  successful 

errcode  Error  number,  service  not  successful 

(implementation  dependent) 

The  value  returned  in  register  gr96  can  be  interpreted  differently,  depending  on 
the  current  operating  mode  of  the  file  identified  by  th efileno  parameter.  The  operat¬ 
ing  mode  is  established  or  changed  by  invoking  the  ioctl  service  (24).  The  following 
table  shows  how  the  Return  Value  in  gr96  should  be  interpreted  for  various  operating 
modes. 


Return  Value 


Count 

Non-ASYNC 

ASYNC 

NBLOCK 

gr96=  Ir4 

success 

n/a 

(NBLOCK  mode  is  not  illegal  for 
write  requests,  but 

0  <  gr96  <  Ir4 

fail 

=0, success 

requests  are  performed  in  either 
synchronous  or 

gr96  <  0 

extreme 

fail 

ASYNC  mode.  Return  values 
are  interprete  accordingly.) 

In  the  Return  Value  table,  for  normal  synchronous  write  service  requests,  the 

return  value  contains  a  count  of  the  number  of  bytes  written.  If  the  value  returned  in 
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gr96  is  equal  to  the  nbytes  argument  passed  to  the  service  in  lr4,  the  write  operation 
was  successful.  Any  other  return  value  indicates  that  an  error  occurred.  If  gr96  con¬ 
tains  a  value  between  0  and  the  value  of  nbytes,  the  failure  is  not  catastrophic.  Nega¬ 
tive  values  returned  in  gr96  indicate  extreme  errors. 

For  ASYNC  mode,  the  operation  is  only  scheduled  by  invoking  the  write  ser¬ 
vice,  so  the  return  value  in  gr96  merely  indicates  that  the  request  succeeded  or  failed. 
A  return  value  of  0  in  gr96  indicates  that  the  asynchronous  write  operation  was  suc¬ 
cessfully  scheduled. 

Non-blocking  write  requests  are  performed  in  either  synchronous  or  asynchro¬ 
nous  mode,  depending  on  whether  the  ASYNC  operating  mode  was  selected. 
NBLOCK  mode  is  ignored,  the  return  value  in  gr96  is  interpreted  according  to  the 
values  shown  for  Non-ASYNC  and  ASYNC  modes  in  the  table. 

Example  Call 


.  word 

0 

. block 

256 

const 

gr96, fd 

/  set  address  of 

consth 

gr96, fd 

/ file  descriptor 

load 

0,0, lr2, gr96 

;  get  file  descriptor 

const 

lr3 , buf 

/  set  buffer  address 

consth 

lr3 , buf 

r 

const 

lr4, 256 

/specify  buffer  size 

const 

grl21, 20 

/ service  =  20 

asneq 

69,  grl , grl 

/call  the  OS 

jmpf 

grl21,  wr_err 

/handle  write  errors 

const 

grl20, num 

/ set  address  of 

consth 

grl20, num 

/ "num"  variable 

store 

0,0,  gr96, grl20 

/ store  bytes  written 

The  example  call  writes  256  bytes  from  the  buffer  located  at  buf  to  the  file 
associated  with  the  descriptor  stored  in/d.  If  errors  are  detected  during  execution  of 
the  service,  the  value  returned  in  grl21  will  be  FALSE.  In  this  case,  the  wr_err  error 
handler  will  be  invoked.  The  number  of  bytes  actually  written  is  stored  in  the  variable 
num. 
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Service  21  -  Iseek 


Seek  a  File  Byte 


Description 

This  service  positions  the  file  associated  with  the  file  descriptor  in  lr2,  offset 
number  of  bytes  from  the  position  of  the  file  referred  to  by  the  orig  parameter.  Regis¬ 
ter  lr3  contains  the  number  of  bytes  offset  and  lr4  contains  the  value  for  orig.  The 
parameter  orig  is  defined  as: 

0  =  Beginning  of  the  file 

1  =  Current  position  of  the  file 

2  =  End  of  the  file 

The  Iseek  service  can  be  used  to  reposition  the  file  pointer  anywhere  in  a  file. 
The  offset  parameter  may  either  be  positive  or  negative.  However,  it  is  considered  an 
error  to  attempt  to  seek  in  front  of  the  beginning  of  the  file.  Any  attempt  to  seek  past 
the  end  of  the  file  is  undefined,  and  is  dependent  on  the  restrictions  of  each  imple¬ 
mentation. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

21  (0x15) 

Service  number 

lr2 

fileno 

File  descriptor 

lr3 

offset 

Number  of  bytes  offset  from  orig 

lr4 

orig 

A  code  number  indicating  the  point 
within  the  file  from  which  the  offset 
is  measured 

Returns: 

gr96 

where* 

*See  Return  Value  table, 

grill 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 

(implementation  dependent) 

The  value  returned  in  register  gr96  can  be  interpreted  differently,  depending  on 
the  current  operating  mode  of  the  file  identified  by  th efileno  parameter.  The  operat¬ 
ing  mode  is  established  or  changed  by  invoking  the  ioctl  service  (24).  The  Return 
Value  table  shows  how  the  Return  Value  in  gr96  should  be  interpreted  for  various 
operating  modes. 
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Return  Value 


Count 

Non-ASYNC 

ASYNC 

NBLOCK 

gr96  >  0 

where 

n/a 

(NBLOCK  mode  is  not  illegal  for 
lseek  requests,  but 

gr96  <  0 

fail 

fail 

requests  are  performed  in  either 
synchronous  or  ASYNC  mode. 
Return  values  are  interpreted 
accordingly.) 

In  the  Return  Value  table,  for  normal  synchronous  lseek  service  requests,  the 
return  value  contains  the  current  position  in  the  file,  if  the  value  is  greater  than  or 
equal  to  0.  Negative  values  returned  in  gr96  indicate  that  the  request  was  not  ac¬ 
cepted. 

The  file  position  returned  by  the  lseek  service  in  gr96  (where)  is  always  mea¬ 
sured  from  the  beginning  of  the  file.  A  value  of  0  refers  to  the  beginning,  and  any 
other  positive  non-zero  value  refers  to  the  current  position  in  the  file.  To  determine 
the  size  in  bytes  for  a  particular  file,  an  lseek  request  with  an  offset  value  of  0  and  an 
orig  value  of  2  will  position  the  file  to  its  end  and  return  the  byte  position  of  the  end- 
of-file,  which  is  an  accurate  measure  of  the  size  of  the  file. 

Asynchronous  lseek  requests  are  allowed  if  the  operating  mode  for  the  file  des¬ 
criptor  associated  with  the  request  has  been  set  to  ASYNC.  In  this  case,  the  file  posi¬ 
tion  returned  in  gr96  (where)  will  not  be  relevant.  The  iowait  service  call  should  be 
used  to  determine  the  final  file  position  when  the  seek  operation  is  complete. 

If  a  previously  issued  read  or  write  request  is  still  in  progress  when  an  lseek  is 
issued,  a  failure  status  will  be  returned  for  the  lseek  request.  Only  one  request  can  be 
pending  at  a  time.  To  properly  handle  this  situation,  the  iowait  service  should  be  used 
to  ensure  the  completion  of  an  outstanding  read  or  write  before  issuing  the  lseek  ser¬ 
vice  request. 

Example  Call 


.  word 

6 

/  file  descriptor  = 

6 

const 

gr96, fd 

/ set  address  of 

consth 

gr96, fd 

; file  descriptor 

load 

0,0, lr2, gr96 

; get  file  descriptor 

consth 

lr3, 23 

/offset  argument  = 

23 

consth 

lr4 ,  0 

/ origin  argument  = 

0 

const 

grl21, 21 

/ service  =  21 

asneq 

69,  grl , grl 

/call  the  OS 

jmpf 

grl21,  seek_err 

/ seek  error  if  false 

nop 

The  call  example  shows  how  a  file  can  be  positioned  to  a  particular  byte  address 
by  specifying  the  orig,  which  is  the  starting  point  from  which  the  file  position  is  ad- 
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justed,  and  the  offset,  which  is  the  number  of  bytes  from  the  orig,  to  move  the  file 
pointer.  In  this  case,  the  file  identified  by  file  descriptor  6  is  being  repositioned  to  byte 
23,  measured  from  the  beginning  of  the  file  (origin  =  0). 

The  file  descriptor,  offset,  and  orig  values  are  loaded  and  lseek  is  called  to  per¬ 
form  the  file  positioning  operation.  If  an  error  occurs  when  attempting  to  reposition 
the  file,  the  value  returned  in  grl21  is  FALSE,  and  contains  an  error  code  that  indi¬ 
cates  the  reason  for  the  error.  Upon  return,  gr96  also  contains  the  file  position  mea¬ 
sured  from  the  beginning  of  the  file. 
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Service  22  -  remove 


Remove  a  File 


Description 

This  service  deletes  a  file  from  the  file  system.  Register  lr2  contains  a  pointer  to 
the  pathname  of  the  file.  The  path  must  point  to  an  existing  file,  and  the  referenced 
file  should  not  be  currently  open.  The  behavior  of  the  remove  service  is  undefined  if 
the  file  is  open.  Any  attempt  to  remove  a  currently  open  file  will  have  an  implementa¬ 
tion-dependent  result. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

22  (0x16) 

Service  number 

lr2 

pathname 

A  pointer  to  string  that  contains  the 
pathname  of  the  file 

Returns: 

gr96 

retval 

Success:  =  0 

Failure:  <  0 

gr!21 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

Example  Call 

path : 

.ascii 

"/usr/ jack/ files/myfile\0" 

const 

consth 

const 

asneq 

jmpf 

nop 

lr2 , path 
lr2 , path 
grl21, 22 

69, grl , grl 
gr!21, rem_err 

;  set  address  of  file 
; pathname . 

; service  =  22 
;call  the  OS 
; jump  if  error 

In  the  example  call,  a  file  with  a  UNIX-style  pathname  stored  in  the  string 
named  path  is  being  removed.  The  address  (pointer)  to  the  string  is  put  into  lr2  and 
the  kernel  service  22  is  called  to  remove  the  file.  If  the  file  does  not  exist,  or  if  it  has 
not  previously  been  closed,  an  error  code  will  be  returned  in  grl21;  otherwise,  the 
value  in  gr!21  will  be  TRUE. 
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Rename  a  File 


Service  23  -  rename 

Description 

This  service  moves  a  file  to  a  new  location  within  the  file  system.  Register  lr2 
contains  a  pointer  to  the  file’s  old  pathname  and  lr3  contains  a  pointer  to  the  file’s  new 
pathname.  When  all  components  of  the  old  and  new  pathnames  are  the  same,  except 
for  the  filename,  the  file  is  said  to  have  been  renamed.  The  file  identified  by  the  old 
pathname  must  already  exist,  or  an  error  code  will  be  returned  and  the  rename  opera¬ 
tion  will  not  be  performed. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

gr!21 

23  (0x17) 

Service  number 

lr2 

oldfile 

A  pointer  to  string  containing  the  old 
pathname  of  the  file 

lr3 

newfile 

A  pointer  to  string  containing  the 
new  pathname  of  the  file 

Returns: 

gr96 

retval 

Success:  =  0 

Failure:  <  0 

grl21 

0x80000000 

Logical  TRUE,  service  successful 

errcode 

Error  number,  service  not  successful 
(implementation  dependent) 

Example  Call 

old : 

.ascii 

"/usr/ fred/payro 

11/ reportXO" 

new : 

.ascii 

"/usr/ fred/history/ june89\0" 

const 

lr2 , old 

; set  address  of  old  pathname 

consth 

lr2 , old 

const 

lr3 , new 

; set  address  of  new  pathname 

consth 

lr3 , new 

const 

grl21, 23 

; service  =  23  (rename) 

asneq 

69, grl , grl 

;call  the  OS 

jmpf 

gr!21, ren_err 

; jump  if  rename  error 

nop 

The  example  call  moves  a  file  from  its  old  path  (renaming  it  in  the  process)  to  its 

new  pathname  location.  The  file  will  no  longer  be  found  at  the  old  location. 
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Service  24  -  ioctl 


Input/Output  Control 


Description 

This  service  establishes  the  operating  mode  of  the  specified  file  or  device.  It  is 
intended  primarily  to  be  applied  to  terminal-like  devices;  however,  certain  modes  ap¬ 
ply  to  mass-storage  files,  or  to  other  related  input/output  devices. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

24  (0x18) 

Service  number 

lr2 

fileno 

File  descriptor  number  to  be  tested 

lr3 

mode 

Operating  mode. 

Returns: 

grl21 

0x80000000 

errcode 

Fogical  TRUE,  service  successful 
error  number,  service  not  successful 

EHIFNOTAVAIL  if  service  not 
implemented 

(implementation  dependent) 


Parameter  Descriptions 

In  the  above  interface,  local  register  lr2  is  expected  to  contain  a  legal  file  des¬ 
criptor,  fileno,  assigned  by  the  HIF  open  service  (HIF  service  number  17).  The  mode 
parameter  establishes  the  desired  operating  mode,  which  is  selected  from  one  or 


more  of  the  following: 

0x0000  COOKED 

Process  I/O  data  characters 

0x0001 

RAW 

Do  not  process  I/O  data  characters 

0x0002 

CBREAK 

Process  only  I/O  signals 

0x0004 

ECHO 

Echo  read  data 

0x0008 

ASYNC 

Asynchronous  data  read 

0x0010 

NBLOCK 

Non-blocking  data  read 

Multiple  mode  values  are  possible;  however,  COOKED,  RAW,  and  CBREAK 
modes  are  mutually  exclusive.  Other  mode  values  can  be  combined  with  these  by 
logically  ORing  them  to  form  a  composite  mode  value.  Certain  of  these  mode  values 
do  not  apply  to  every  open  file  descriptor.  For  example,  the  AS  YNC  mode  is  used  to 
establish  a  data  input  mode  that  will  cause  a  read,  write,  or  lseek  operation,  once 
initiated,  to  complete  at  a  later  time.  With  the  ASYNC  mode  set,  a  read  or  write  re- 
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quest  will  immediately  return  after  passing  the  buffer  address  and  file  descriptor  to 
the  operating  system,  leaving  the  scheduling  of  the  operation  up  to  the  HIF  imple¬ 
mentation.  lseek  operations  can  also  be  serviced  in  ASYNC  mode.  The  completion 
status  of  these  operations  can  be  tested  by  issuing  an  iowait  service  request  (HIF  ser¬ 
vice  number  25).  When  a  read  or  write  operation  is  issued  for  a  file  descriptor  whose 
operating  mode  is  ASYNC,  the  count  returned  in  gr96  will  be  0  if  the  operation  was 
accepted,  or  less  than  0  if  the  operation  was  rejected.  An  iowait  service  should  be 
issued  to  ascertain  the  number  of  bytes  that  have  been  transferred  upon  completion  of 
the  operation. 

The  default  I/O  processing  mode  is  COOKED  (0x0000),  which  implies  that  the 
HIF  implementation  examines  input  and  output  data  characters  as  they  are  received, 
or  before  they  are  sent,  and  may  perform  some  alteration  of  the  data.  Specific  alter¬ 
ations  are  not  explicitly  indicated  in  this  specification;  however,  it  is  common  to  per¬ 
form  end-of-line  processing  for  files  whose  operating  mode  is  COOKED.  ASCII  car¬ 
riage-return  and  line-feed  translations  are  common,  as  may  be  the  translation  of 
ASCII  TAB  characters  to  a  number  of  equivalent  spaces.  When  RAW  mode  is  se¬ 
lected,  no  translation  of  input  or  output  characters  will  be  performed  by  HIF-con- 
forming  implementations. 

Normally,  when  a  read  operation  is  issued  for  a  terminal-like  device  by  the  ap¬ 
plication  program,  the  processor  will  block  any  further  execution  of  the  subject  pro¬ 
gram  until  the  data  has  been  transferred.  The  NBLOCK  mode  is  intended  to  specify 
for  terminal-like  devices  that  subsequent  read  operations  be  executed  without  sus¬ 
pending  (blocking)  further  CPU  operation.  This  is  particularly  relevant  to  read  op¬ 
erations  when  RAW  mode  is  also  selected.  If  NBLOCK  mode  has  been  specified,  a 
subsequent  read  operation  will  return  (in  gr96 )  the  number  of  characters  currently 
available,  or  -1  if  none  are  available.  NBLOCK  mode  is  not  meaningful  for  write 
operations,  but  they  are  handled  in  the  same  fashion  as  synchronous  or  asynchronous 
operations,  depending  on  whether  ASYNC  mode  was  specified. 

RAW  mode  delivers  the  characters  to/from  the  I/O  device  without  conversion  or 
interpretation  of  any  kind. 

If  COOKED  mode  has  been  selected,  line -buffering  is  implied.  If  NBLOCK  is 
also  selected,  a  subsequent  read  operation  will  return  -1  for  the  count ,  unless  an  en¬ 
tire  line  of  input  is  available. 

The  ECHO  mode  applies  only  to  the  standard  input  device  (file  descriptor  =  0), 
and  makes  provision  to  automatically  echo  data  received  from  that  device  to  the  stan¬ 
dard  output  device  (file  descriptor  =1).  ECHO  mode  is  undefined  for  any  other  file 
descriptor. 

The  CBREAK  mode  is  intended  for  file  descriptors  that  refer  to  serial  commu¬ 
nication  channels.  CBREAK  mode  specifies  that  I/O  signal  inputs  will  be  processed, 
which  could  alter  the  data  stream. 
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The  NBLOCK  and  ASYNC  settings  are  not  necessarily  mutually  exclusive. 
There  may  be  occasions  where  this  is  a  legal  mode.  NBLOCK  specifies  that  subse¬ 
quent  read,  write,  or  lseek  operations  not  block  until  completion.  If  a  read  is  re¬ 
quested,  for  example,  and  no  data  is  currently  available,  the  read  service  will  return 
-1  (with  an  errcode  value  in  grl21  of  EAGAIN),  rather  than  blocking  further  execu¬ 
tion  until  data  becomes  available.  ASYNC  mode  simply  allows  an  operation,  once 
invoked,  to  proceed  asynchronously  with  other  operations,  if  the  HIF  implementa¬ 
tion  provides  this  capability. 

If  the  above  mode  settings  are  not  implemented,  the  EHIFNOTAVAIL  error 
code  should  be  returned  to  the  user  if  the  ioctl  service  is  invoked. 

Although  the  mode  parameter  occupies  a  32-bit  word,  only  the  low-order 
16-bits  are  reserved.  The  upper  16-bits  are  available  for  implementation-dependent 
mode  settings,  and  are  not  part  of  this  specification. 

Example  Call 


word 

0 

; variable  to  contain  the  file 
; descriptor 

const 

grl20,  fd 

; Get  fd  address 

consth 

grl20,  fd 

load 

0 ,  0 , lr2 , grl2 0 

; load  file  descriptor 

const 

lr3 , 0x0010 

; NBLOCK  mode 

const 

grl21, 24 

; service  =  24 

asneq 

69, grl , grl 

;call  the  OS 

jmpf 

nop 

grl2 1 , io_err 

; jump  if  failure 

In  the  example  call,  a  previously  assigned  file  descriptor  is  passed  to  the  service, 
in  order  to  specify  that  subsequent  read  requests  not  block  if  data  is  not  available.  If  an 
error  occurs  when  servicing  this  request,  grl21  will  be  set  to  FALSE  and  the  program 
will  jump  to  an  error  handling  routine  (io_err)  when  the  service  returns. 
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Service  25  -  iowait 


Test  &  Wait  I/O  Complete 


Description 

This  service  is  used  in  conjunction  with  the  ioctl  (ASYNC  mode)  and  read, 
write,  or  lseek  services  to  test  the  completion  of  an  asynchronous  input/output  opera¬ 
tion  and,  optionally,  to  wait  until  the  operation  is  complete.  The  iowait  service  is 
called  with  the  file  descriptor  returned  by  the  open  service  when  the  file  was  original¬ 
ly  opened.  The  mode  parameter  specifies  whether  the  iowait  will  block  until  the  op¬ 
eration  is  complete,  or  immediately  return  the  completion  status  in  the  result  register 
( gr96 ).  If  the  operation  was  complete,  gr96  will  contain  the  number  of  bytes  trans¬ 
ferred  for  read  or  write  service  requests  {count),  or  the  ending  file  position  (mea¬ 
sured  from  the  beginning  of  the  file)  for  lseek  service  requests  (where). 

If  no  previous  asynchronous  (ioctl  ASYNC  mode)  read,  write,  or  lseek  service 
is  pending  for  the  specified  file  descriptor,  or  if  an  unrecognized  mode  value  is  pro¬ 
vided,  the  iowait  service  will  return  an  error  status  in  grl21. 

Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

25 (0x19) 

Service  number 

lr2 

fileno 

file  descriptor,  as  returned  by  open 
(17). 

lr3 

mode 

1  =  non-blocking  completion  test 

2  =  wait  until  read  operation 

complete 

Returns: 

gr96 

count  * 

*  See  Return  Value  table 

grill 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
error  number,  service  not  successful 
(implementation  dependent) 

The  value  returned  in  register  gr96  can  be  interpreted  differently,  depending  on 
the  value  specified  in  the  mode  parameter  (in  register  lr3 )  of  the  service  request.  The 
Return  Value  table  shows  how  the  return  value  in  gr96  should  be  interpreted  for  non- 
blocking  and  blocking  completion  tests. 
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Return  Value 


Count  Blocking  Tests 

read/write  Iseek 


Non-blocking  Tests 
read/write  Iseek 


gr96>  0  count  where  count  where 

gr96  =  0  EOF  where  EOF  where 

gr96  <  0  fail  fail  If  =  -1  &  gr121  =  EAGAIN, 

there  is  no  data  available; 
otherwise,  fail. 

In  the  Return  Value  table,  for  blocking  completion  tests,  the  return  value  speci¬ 
fies  the  status  of  the  completed  operation.  If  the  operation  was  a  read  or  write  service 
request,  the  count  value  specifies  the  number  of  bytes  actually  transferred  ( gr96  >  0), 
an  end-of-file  condition  was  reached  ( gr96  =  0),  or  that  a  failure  occurred  ( gr96  <  0). 
For  Iseek  requests,  the  return  value  specifies  the  current  position  of  the  file,  unless  the 
value  is  negative,  in  which  case  a  failure  occurred. 

The  return  value  for  non-blocking  completion  tests  of  read  and  write  service 
requests  is  interpreted  the  same  as  for  blocking  completion  tests,  except  for  the  case 
where  the  value  in  gr96  is  equal  to  -1 .  In  this  case,  and  if  the  errcode  in  register  grl  21 
is  EAGAIN,  then  no  data  is  currently  available.  Any  other  negative  return  value  or 
error  code  signals  a  failure  condition. 

The  iowait  service  reports  errors  that  may  have  occurred  in  the  outstanding 
asynchronous  operation —  subsequent  to  its  original  issue — as  well  as  errors  in  the 
iowait  call  itself. 

Example  Call 


.  word 

0 

;  file  descriptor 

const 

lr3 ,  1 

;  non-blocking  completion 

const 

grl21, 25 

/service  =  25  (iowait) 

const 

grl2  0 , f d 

;  load  file  descriptor  adds 

consth 

grl2  0 , f d 

r 

load 

0 ,  0 , lr2 , grl2 0 

;  get  file  descriptor 

asneq 

69, grl , grl 

/call  the  OS 

jmpf 

grl 2 1 ,  wait_err 

/handle  wait  error 

const 

lr3 ,  1 

/non-blocking  completion 

jmpt 

gr96, loop 

/wait  until  op.  complete 

const 

grl2 1 , 25 

/service  =  25  (iowait) 

In  the  example  call,  the  file  descriptor  (fileno )  is  loaded  into  lr2,  non-blocking 
mode  is  selected,  and  the  iowait  service  is  invoked.  If  the  service  returns  an  error  sta¬ 
tus  in  grill ,  the  program  will  jump  to  the  wait_err  label.  If  the  operation  is  accepted, 
gr96  will  contain  the  completion  status  upon  return  from  the  service.  This  example 
jumps  to  reinvoke  the  service  if  the  operation  is  not  yet  complete.  This  is  equivalent 
to  issuing  a  iowait  service  with  a  mode  value  of  2,  specifying  that  the  operation 
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should  block  until  the  operation  is  complete.  A  more  complex  program  might  per¬ 
form  some  useful  work  before  re-trying  the  operation. 
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Service  26  -  iostat 


Input/Output  Status 


Description 

This  service  returns  the  status  corresponding  to  a  file  descriptor  assigned  by  the 
open  service.  If  the  specified  file  descriptor  is  not  legal,  an  error  code  will  be  returned 
in  grl21\  otherwise,  grl21  will  contain  a  TRUE  result,  and  gr96  will  contain  the  re¬ 
quested  status.  Two  status  values  are  defined: 

0x0001  RDREADY  Input  device  ready  and  data  available 

0x0002  ISATTY  File  descriptor  refers  to  terminal-like  device  (TTY) 

Application  programs  frequently  need  to  determine  if  data  is  currently  available 
to  be  read  for  a  terminal-like  device.  If  the  RDREADY  status  is  returned,  at  least  one 
byte  of  data  is  available  to  be  read  from  the  device. 

The  ISATTY  status  indicates  that  the  device  associated  with  the  file  descriptor 
refers  to  a  terminal-like  peripheral,  rather  than  a  mass-storage  file  or  other  peripheral 
device.  The  iostat  service  can  be  used  to  determine  if  a  standard  output  device  (file 
descriptors  1  or  2)  refers  to  a  terminal,  or  if  output  is  being  redirected  to  a  mass-stor¬ 
age  file. 

The  RDREADY  and  ISATTY  status  values  are  not  mutually  exclusive;  either  or 
both  results  may  be  present.  Although  the  status  is  returned  in  a  32-bit  word,  only  the 
lower  16  bits  are  reserved  for  HIF-conforming  reply  values.  The  upper  16  bits  are 
available  for  implementation-specific  status  results. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

26  (0x19) 

Service  number 

lr2 

fileno 

File  descriptor  number 

Returns: 

gr96 

iostat 

Input  status 

0x0001=  RDREADY 

0x0002=  ISATTY 

gr!21 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
error  number,  service  not  successful 
(implementation  dependent) 

472 


Evaluating  and  Programming  the  29K  RISC  Family 


Example  Call 


const 

lr2 ,  0 

;  set 

file 

descriptor  = 

const 

grl2 1 , 2  6 

; service 

=  26 

asneq 

69, grl , grl 

;  call 

the 

OS 

jmpf 

grl2 1 , fail 

; handle  failure 

sll 

grl20, gr96, 30 

;  test 

ISATTY  status  bit 

jmpf 

gr!2  0 , not_tty 

;  jump 

if 

not  a  tty 

nop 

In  the  example  call,  the  program  calls  the  iostat  service  to  determine  if  the  de¬ 
vice  associated  with  file  descriptor  0  is  a  tty-like  device.  If  the  service  returns  an  er¬ 
ror  indication  in  grl21 ,  the  program  jumps  to  the  fail  label;  otherwise,  the  iostat  val¬ 
ue  returned  in  gr96  is  shifted  to  put  bit- 1  of  the  result  into  the  sign-bit  of  grl20,  which 
is  tested  to  determine  if  the  file  descriptor  refers  to  a  tty-like  device.  If  not,  the  pro¬ 
gram  jumps  to  the  not_tty  label. 
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Return  Temporary  Name 


Service  33  -  tmpnam 

Description 

This  service  generates  a  string  that  can  be  used  as  a  temporary  file  pathname.  A 
different  name  is  generated  each  time  it  is  called.  The  name  is  guaranteed  not  to  dupli¬ 
cate  any  existing  filename.  The  argument  passed  in  lr2  should  be  a  valid  pointer  to  a 
buffer  that  is  large  enough  to  contain  the  constructed  file  name.  User  programs  are 
required  to  allocate  a  minimum  of  128  bytes  for  this  purpose. 

If  the  argument  in  lr2  contains  a  NULL  pointer,  the  HIF  service  routine  should 
treat  this  as  an  error  condition  and  return  a  non-zero  error  number  in  global  register 
grl21. 

The  HIF  specification  sets  no  standards  for  the  format  or  content  of  legal  path¬ 
names;  these  are  determined  by  individual  operating  system  requirements.  Each  im¬ 
plementation  must  undertake  to  construct  a  valid  filename  that  is  also  unique. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

33  (0x21) 

Service  number 

lr2 

addrptr 

A  pointer  to  buffer  into  which  the 
filename  is  to  be  stored 

Returns: 

gr96 

filename 

Success:  pointer  to  the  temporary 
filename  string. 

Failure:  =  0  (  NULL  pointer) 

grl21 

Example  Call 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

fbuf : 

. block 

21 

/buffer  size  =  21  bytes 

const 

lr2 , fbuf 

; set  buffer  pointer 

consth 

lr2 , fbuf 

r 

const 

grl21, 33 

; service  =  33 

asneq 

69, grl , grl 

/call  the  OS 

jmpf 

nop 

grl2 1 , tmp_err 

/ jump  if  error 

In  the  example  call,  the  tmpnam  service  is  called  with  a  pointer  t ofbuf,  which 
has  been  allocated  to  hold  a  name  that  is  up  to  21  bytes  in  length.  If  the  service  is  able 
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to  construct  a  valid  name,  the  filename  will  be  stored  in/frw/when  the  service  returns. 
If  the  content  of  grl21  on  return  is  not  TRUE,  the  program  fragment  jumps  to 
tmp_err  to  handle  the  error  condition. 
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Service  49  -  time 


Return  Seconds  Since  1970 


Description 

This  service  returns,  in  register  gr96,  the  number  of  seconds  elapsed  since  mid¬ 
night,  January  1, 1970,  as  an  integer  32-bit  value.  It  is  assumed  that  the  kernel  service 
will  have  access  to  a  counter,  whose  contents  can  be  preloaded,  that  measures  time 
with  at  least  a  one-second  resolution,  for  this  purpose. 

The  time  value  returned  by  this  service  is  Greenwich  Mean  Time  (GMT).  The 
conversion  to  local  time  should  be  accomplished  by  a  separate  function  that  uses  the 
value  returned  by  the  time  service  and  the  time-zone  information  from  the  gettz  (Get 
time  zone)  service  call  to  compute  the  correct  local  time. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

49  (0x31) 

Service  number 

Returns: 

gr96 

secs 

Success:  ^  0  (time  in  seconds) 
Failure:  =  0 

grl21 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

Example  Call 

secs : 

.  word 

0 

const 

asneq 

jmpf 

const 

consth 

store 

grl21, 49 

69, grl , grl 
grl2 1 , tim_err 
grl20, secs 
grl2  0 , secs 

0,0, gr96, gr!20 

;  service  =  4  9 
;call  the  OS 
;  jump  if  error 
;  set  the  address 
;  for  storing  'secs' 

;  store  the  seconds 

In  the  example  call,  the  kernel  service  time  is  being  called.  If  the  value  returned 
in  grl21  is  TRUE,  the  number  of  seconds  returned  in  gr96  is  stored  in  the  secs  vari¬ 
able;  otherwise,  the  program  jumps  to  tim_err  to  determine  the  cause  of  the  error. 
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Get  Environment 


Service  65  -  getenv 

Description 

This  service  searches  the  system  environment  for  a  string  associated  with  a  spe¬ 
cified  symbol.  Register  lr2  contains  a  pointer  to  the  symbol  name.  If  the  symbol  name 
is  found,  a  pointer  to  the  string  associated  with  it  is  returned  in  gr96\  otherwise,  a 
NULL  pointer  is  returned. 

In  UNIX-hosted  systems,  the  setenv  command  allows  a  user  to  associate  a  sym¬ 
bol  with  an  arbitrary  string.  For  example,  the  command  setenv  TERM  vtlOO  defines 
the  string  vtlOO  to  be  associated  with  the  symbol  named  TERM.  Application  pro¬ 
grams  can  use  this  association  to  determine  the  type  of  terminal  connected  to  the  sys¬ 
tem,  and,  therefore,  use  the  correct  set  of  codes  when  outputting  information  to  the 
user’s  screen.  To  access  the  string,  getenv  should  be  called  with  lr2  pointing  to  a 
string  containing  the  TERM  symbol  name.  The  address  returned  in  gr96  will  point  to 
the  corresponding  vtlOO  string  if  TERM  is  found.  In  UNIX-hosted  systems,  entering 
a  different  setenv  command  lets  the  user  select  a  different  terminal  name  without 
requiring  recompilation  of  the  application  program. 

Operating  system  implementations  that  do  not  include  provisions  for  environ¬ 
ment  variables  should  always  return  a  NULL  value  in  gr96  when  this  service  is  re¬ 
quested. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

65  (0x41) 

Service  number 

lr2 

name 

A  pointer  to  the  symbol  name 

Returns: 

gr96 

addrptr 

Success:  pointer  to  the  symbol  name 
string 

Failure:  =  0  (  NULL  pointer) 

grl21 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 
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Example  Call 

mysym : 

.ascii 

"MYSYMBOIAO" 

strptr : 

.  word 

0 

const 

lr2 , mysym 

;  set  address  of  symbol  to 

consth 

lr2 , mysym 

;be  located  in  environment 

const 

grl21, 65 

; service  =  65 

asneq 

69,  grl , grl 

;call  the  OS 

jmpf 

grl21,  env_err 

; jump  if  error 

const 

grl20, strptr 

; set  address  of 

consth 

grl2  0 , strptr 

; string  pointer 

store 

0,0,  gr96, gr!20 

; store  string  pointer 

The  example  call  program  calls  the  operating  system  getenv  service  to  access  a 
string  associated  with  the  environment  variable  MYSYMBOL.  If  the  symbol  is  found, 
a  pointer  to  the  string  associated  with  the  symbol  is  returned  in  gr96.  If  the  call  is  not 
successful  (i.e.,  grl21  holds  a  FALSE  boolean  value  upon  return),  the  program  jumps 
to  env_err  to  handle  the  error  condition. 
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Get  Time  Zone 


Service  67  -  gettz 

Description 

This  service  obtains  time  zone  information  from  the  operating  system.  No  argu¬ 
ments  are  required.  The  service  returns  in  gr96  an  integer  number  of  minutes  of  time, 
specifying  the  correction  to  Greenwich  Mean  Time  (GMT)  for  localities  west  of 
Greenwich,  England.  A  negative  return  value  in  gr96  indicates  a  failure,  or  that  time 
zone  information  is  unavailable.  A  value  is  also  returned  in  gr97.  If  Daylight  Savings 
Time  is  currently  in  effect,  gr97  will  contain  the  value  1  when  the  service  returns;  if  it 
is  not  in  effect,  gr97  will  contain  the  value  0.  If  this  information  is  not  available,  or  if 
the  service  fails,  gr97  will  contain  a  negative  value. 

Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

67(0x42) 

Service  number 

Returns: 

gr96 

zonecode 

Success:  >  0  (minutes  west  of  GMT) 
Failure:  <  0  (or  information 
unavailable) 

gr97 

dstcode 

Success:  =  1  (Daylight  Savings 

Time  in  effect) 

=  0  (Daylight  Savings 

Time  not  in  effect) 

grill 

0x80000000 

Logical  TRUE,  service  successful 

error  number,  service  not  successful 
(implementation  dependent) 

If  the  result  returned  in  gr96  ( zonecocle )  contains  a  value  greater  than  1,440  (60 
minutes  x  24  hours),  then  1 ,440  should  be  subtracted  from  the  result,  which  relates  to 
minutes  east  of  Greenwich. 
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Example  Call 

timzone : 

.  word 

0 

dstflag : 

.  word 

0 

const 

grl21, 67 

asneq 

69, grl , grl 

jmpf 

grl21,  tz_err 

const 

lr2 , timzone 

consth 

lr2 , timzone 

store 

0,0, gr96, lr2 

const 

lr2 , dstflag 

consth 

lr2 , dstflag 

store 

0 ,  0 , gr 97 , lr2 

; service  =  67 

;call  the  OS 

;  jump  if  error 

;the  adds,  to  storetimezone 

; store  the  timezone  correction 
;the  addres  to  store  daylight 
; savings 

; store  daylight  savings  flag 


In  the  example  call,  the  gettz  service  is  called  to  access  the  current  time  zone 
correction  value.  Upon  return,  grl21  is  tested  to  determine  if  the  service  was  success¬ 
ful.  If  not,  the  program  jumps  to  an  error  handling  routine  called  tz_err.  If  the  service 
was  successful,  the  values  returned  in  gr96  and  gr97  are  stored  in  local  variables 
called  timzone  and  dstflag,  respectively. 
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Allocate  Memory  Space 


Service  257  -  sysalloc 

Description 

This  service  allocates  a  specified  number  of  contiguous  bytes  from  the  operat¬ 
ing-system-maintained  heap  and  returns  a  pointer  to  the  base  of  the  allocated  block. 
Register  Ir2  contains  the  number  of  bytes  requested.  If  the  storage  is  successfully  al¬ 
located,  gr96  contains  a  pointer  to  the  block;  otherwise,  grl21  contains  an  error  code 
indicating  the  reason  for  failure  of  the  call. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

257 (0x101) 

Service  number 

lr2 

nbytes 

Number  of  bytes  requested 

Returns: 

gr96 

addrptr 

Success:  pointer  to  allocated  bytes, 
Failure:  =  0  (  NULL  pointer) 

grl21 

Example  Call 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

blkptr : 

.  word 

0 

const 

lr2,  1200 

/request  1200  bytes 

const 

grl21, 257 

/ service  =  257 

asneq 

69, grl , grl 

/call  the  OS 

jmpf 

grl21, alloc_err 

/ jump  if  error 

const 

grl20, blkptr 

/ set  address  to  store 

consth 

grl2  0 , blkptr 

/ pointer 

store 

0,0, gr96, grl20 

/ store  the  pointer 

The  example  call  requests  a  block  of  1200  contiguous  bytes  from  the  system 
heap.  If  the  call  is  successful,  the  program  stores  the  pointer  returned  in  gr96  into  a 
local  variable  called  blkptr.  If  grl21  contains  a  boolean  FALSE  value  when  the  ser¬ 
vice  returns,  the  program  jumps  to  alloc_err  to  handle  the  error  condition. 
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Free  Memory  Space 


Service  258  -  sysfree 

Description 

This  service  returns  memory  to  the  system  starting  at  the  address  specified  in 
lr2.  Register  lr3  contains  the  number  of  bytes  to  be  released.  The  pointer  passed  to  the 
sysfree  service  in  Ir2  and  the  byte  count  passed  in  lr3  must  match  the  address  re¬ 
turned  by  a  previous  sysalloc  service  request  for  the  identical  number  of  bytes.  No 
dynamic  memory  allocation  structure  is  implied  by  this  service.  High-level  language 
library  functions  such  as  malloc()  and  free()  for  the  C  language  are  required  to  man¬ 
age  random  dynamic  memory  block  allocation  and  deallocation,  using  the  sysalloc 
and  sysfree  kernel  functions  as  their  basis. 

Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

258 (0x102) 

Service  number 

lr2 

addrptr 

Starting  address  of  area  returned 

lr3 

nbytes 

Number  of  bytes  to  release 

Returns: 

gr96 

retval 

Success:  =  0 

Failure:  <  0 

gr!21 

Example  Call 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

blkptr : 

.  word 

0 

const 

grl2  0 , blkptr 

; set  address  of  previously 

consth 

grl20, blkptr 

;block  pointer 

load 

0 , 0 , lr2 , grl2  0 

; fetch  pointer  to  block 

const 

lr3, 1200 

; set  number  of  bytes  to  release 

const 

grl21, 258 

; service  =  258 

asneq 

69, grl , grl 

;call  the  OS 

jmpf 

nop 

grl21, free_err 

; jump  if  error 

I 

The  example  calls  sysfree  to  deallocate  1200  bytes  of  contiguous  memory,  be¬ 
ginning  at  the  address  stored  in  the  blkptr  variable.  If  the  call  is  successful,  the  pro¬ 
gram  continues;  otherwise,  if  the  return  value  in  grl21  is  FALSE,  the  program  jumps 
to  free_err  to  handle  the  error  condition. 
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Return  Memory  Page  Size 


Service  259  -  getpsize 

Description 

This  service  returns,  in  register  gr96,  the  page  size,  in  bytes,  used  by  the  memory 
system  of  the  HIF  implementation. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

259  (0x103) 

Service  number 

Returns: 

gr96 

pagesize 

Success:  memory  page  size,  one  of 
the  following:  1024,2048,4096,8192 
Failure:  <  0 

grl21 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

Example  Call 

pagsiz : 

.  word 

0 

const 

asneq 

jmpf 

const 

consth 

store 

grl21, 259 

69, grl , grl 
grl2 1 , pag_err 
grl20, pagsiz 
grl20, pagsiz 

0,0, gr96, gr!20 

; service  =  259 
;call  the  OS 
; jump  if  error 
; set  address  to 
;  store  the  page  size 
;  store  it ! 

The  example  calls  the  operating  system  kernel  to  return  the  page  size  used  by  the 
virtual  memory  system.  If  the  call  was  successful,  grl21  will  contain  a  boolean 
TRUE  result  and  the  program  will  store  the  value  in  gr96  into  the  pagsiz  variable; 
otherwise,  a  boolean  FALSE  is  returned  in  grl21.  In  this  case,  the  program  will  jump 
to  pag_err  to  handle  the  error  condition. 
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Return  Base  Address 


Service  260  -  getargs 

Description 

This  service  returns  the  base  address  of  the  command-line-argument  vector, 
argv,  in  register  gr96,  as  constructed  by  the  operating  system  kernel  when  an  applica¬ 
tion  program  is  invoked. 

Arguments  are  stored  by  the  operating  system  as  a  series  of  NULL-terminated 
character  strings.  A  pointer  containing  the  address  of  each  string  is  stored  in  an  array 
whose  base  address  (referred  to  as  argv )  is  returned  by  the  getargs  HIF  service.  The 
last  entry  in  the  array  contains  a  NULL  pointer  (an  address  consisting  of  all  zero  bits). 
The  number  of  arguments  can  be  computed  by  counting  the  number  of  pointers  in  the 
array,  using  the  fact  that  the  NULL  pointer  terminates  the  list. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl21 

260 (0x104) 

Service  number 

Returns: 

gr96 

baseaddr 

Success:  base  address  of  argv 
Failure:  =  0  (  NULL  pointer) 

Example 

grl21 

Call 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

argptr : 

.  word 

0 

const 

grl21, 260 

; service  =  260 

asneq 

69, grl , grl 

;call  the  OS 

jmpf 

grl2 1 , bas_err 

; jump  if  error 

const 

grl20, argptr 

; set  address  where  base 

consth 

grl2  0 , argptr 

/pointer  is  to  be  stored 

store 

0,0, gr96, gr!20 

; store  the  pointer 

The  example  calls  operating  system  service  260  to  access  the  command-line-ar¬ 
gument  vector  address.  If  the  service  executes  without  error,  the  program  continues 
by  storing  the  argument  vector  address  in  the  variable  basptr.  If  grl21  contains  a 
boolean  FALSE  value  upon  return,  the  program  jumps  to  bas_err  to  handle  the  error 
condition. 
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Service  273  -  clock 


Return  Time  in  Milliseconds 


Description 

This  service  returns  the  elapsed  processor  time  in  milliseconds.  Operating  sys¬ 
tem  initialization  procedures  set  this  value  to  zero  on  startup.  Successive  calls  to  this 
service  return  times  that  can  be  arithmetically  subtracted  to  accurately  measure  time 
intervals. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

gr!21 

273  (Oxlll) 

Service  number 

Returns: 

gr96 

msecs 

Success:  ^  0  (time  in  milliseconds) 
Failure:  =  0 

grl21 

0x80000000 

errcode 

Fogical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

Example  Call 

time : 

.  word 

0 

const 

asneq 

jmpf 

const 

consth 

store 

grl21, 273 

69, grl , grl 
grl21, clk_err 
grl20, time 
grl2  0 , time 

0,0, gr96, gr!20 

;  service  =  273 
;call  the  OS 
; jump  if  error 
; set  the  address  where 
;time  is  to  be  stored 
; store  the  time  in  ms. 

The  example  calls  the  operating  system  kernel  to  get  the  current  value  of  the  sys¬ 
tem  clock  in  milliseconds.  On  return,  if  grl21  contains  a  boolean  FALSE  value,  the 
program  jumps  to  clk_err  to  handle  the  error;  otherwise,  the  time  in  milliseconds  is 
stored  in  the  variable  time. 

The  return  value  from  the  clock  service  does  not  include  system  I/O  data  trans¬ 
fer  time  incurred  by  HIF  services  with  service  numbers  less  than  256.  The  return  val¬ 
ue  is  related  to  the  value  returned  by  the  cycles  service,  in  that  it  is  derived  from  the 
processor  cycles  counter,  but  scaled  by  the  processor  frequency  and  resolved  to  ms. 
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Return  Processor  Cycles 


Service  274  -  cycles 

Description 

This  service  returns  an  ascending  positive  number  in  registers  gr96  and  gr97, 
that  is,  the  number  of  processor  cycles  that  have  elapsed  since  the  last  processor  ini¬ 
tialization  was  applied  to  the  CPU.  It  provides  a  mechanism  for  user  programs  to  ac¬ 
cess  the  contents  of  the  internal  Am29000  processor  timer  counter  register.  The  cycle 
count  can  be  multiplied  by  the  speed  of  the  processor  clock  to  convert  it  to  a  time  val¬ 
ue.  Gr97  will  contain  the  most  significant  bits  of  the  cycle  count,  while  gr96  will  con¬ 
tain  the  least  significant  bits.  HIF  implementations  of  this  service  are  required  to  pro¬ 
vide  a  cycle  count  with  a  minimum  of  42  bits  of  precision. 

The  implementor  of  this  HIF  service  must,  as  best  possible,  eliminate  system 
I/O  data  transfer  time  incurred  by  HIF  services  with  service  numbers  less  than  256. 
This  will  benefit  the  user  when  using  this  service  to  perform  benchmarks  across  dif¬ 
ferent  hardware  platforms.  The  user  of  this  service  should  be  aware  that  the  return 
value  may  stick  contain  cycles  used  in  support  of  operating  system  tasks. 


Register  Usage 


Type  Regs  Contents  Description 


Calling: 

grl21 

21 A  (0x112) 

Returns: 

gr96 

cycles 

gr97 

cycles 

grl21 

0x80000000 

errcode 

Service  number 

Success:  Bits  0-31  of  processor 
cycles 

Failure:=  0  (in  both  gr96  and  gr97) 
Success:  Bits  32  and  higher  of 
processor  cycles 

Failure:=  0  (in  both  gr96  and  gr97) 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 
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Example  Call 


cycles:  .word  0 

.word  0 


;MSb  of  cycles 
;  LSb  of  cycles 


const  grl21,274 
asneq  69,grl,grl 
jmpf  grl2 1 , cyc_err 

const  grl20, cycles 
consth  grl20, cycles 
store  0 , 0 , gr 97 , grl2 0 

add  grl2 0 , grl2 0 , 4 

store  0, 0, gr96, grl20 


; service  =  274 
;call  the  OS 
; jump  if  error 
/  set  the  address  where  the 
; count  is  to  be  stored 
; store  the  MSb, 

/increment  the  address, 

/then  store  the  LSb  of  cycles. 


The  example  call  program  fragment  calls  the  operating  system  service  274  to 
access  the  number  of  CPU  cycles  that  have  elapsed  since  processor  initialization.  The 
cycle  count  (in  gr96  and  gr97)  is  stored  in  the  two  words  addressed  by  the  variable 
cycles  if  the  service  call  is  successful.  If  grl21  contains  a  boolean  FALSE  value  on 
exit,  the  program  jumps  to  cyc_err  to  handle  the  error  condition. 
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Service  289  -  setvec 


Set  Trap  Address 


Description 

This  service  sets  the  address  for  user-level  trap  handler  services  that  implement 
the  local  register  stack  spill  and  fill  traps.  In  addition,  if  the  current  HIF  implementa¬ 
tion  supports  program  calls  to  set  other  trap  vectors,  this  service  provides  that  capa¬ 
bility.  It  returns  an  indication  of  success  or  failure  in  register  grl  21 .  The  method  used 
to  invoke  these  traps  in  user  mode  is  described  on  page  NO  TAG  of  this  specification, 
in  the  User-Mode  Traps  section. 

The  only  vectors  supported  by  this  specification  are  64  (spill)  and  65  (fill). 
These  vectors  are  invoked  by  operating  system  software,  using  the  trampoline  prin¬ 
ciples  described  in  the  section  User-Mode  Traps,  and  are  not  supported  by  the 
Am29000  processor  hardware. 

Extensions  to  this  service,  in  implementations  that  support  setting  traps  other 
than  spill  and  fill,  will  return  the  previously  installed  trap  address  in  register  gr96,  if 
the  service  is  successful.  For  User  Mode  Traps,  re gister  gr9<5 reports  only  the  success 
or  failure  of  the  service.  In  HIF  implementations  where  the  extended  setvec  service  is 
available,  programs  can  use  the  returned  (previous)  vector  address  to  implement  vec¬ 
tor  chaining. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grl  21 

289  (0x121) 

Service  number 

lr2 

trapno 

trap  number 

lr3 

funaddr 

address  of  trap  handler 

Returns: 

gr96 

trapaddr 

For  user  mode  traps: 

Success:=  0 

Failure:  <  0 

For  extended  trap  vectors: 
Success:previous  trap  address 

Failure:  =  0 

gr!21 

0x80000000 

errcode 

Fogical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 
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Example  Call 

trpadr : 

.  word 

0 

const 

lr2 ,  64 

;trap  number  =  64 

const 

lr3 , 1 64_hnd 

; set  address  of 

consth 

lr3 , 1 64_hnd 

; trap-64  handler 

const 

grl21, 289 

; service  =  289 

asneq 

69,  grl , grl 

;call  the  OS 

jmpf 

grl2 1 , vec_err 

; jump  if  error 

const 

grl20,  trpadr 

; set  address  where  to 

consth 

grl2  0 , trpadr 

; store  the  trap  address 

store 

0,0,  gr96, gr!20 

; and  store  it! 

The  example  calls  the  setvec  service  to  pass  the  address  to  be  used  for  the  trap  64 
trap  handler  routine.  If  the  service  returns  with  grl21  containing  a  boolean  TRUE 
result,  the  program  continues  by  storing  the  trap  address  returned  in  gr96\  otherwise, 
the  program  jumps  to  vec_err  to  handle  the  error  condition. 
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Set  Trap  Vector 


Service  290  -  settrap 

Description 

This  service  provides  the  means  to  install  trap  handler  addresses  directly  into  the 
vector  table  whose  base  address  is  pointed  to  by  the  Vector  Area  Base  Address  spe¬ 
cial-purpose  register  (VAB).  The  vector  numbers  that  may  legally  be  modified  by 
this  service  are  implementation  dependent. 

Implementations  that  do  not  intend  to  provide  the  ability  to  set  trap  addresses 
with  this  service  should  return  the  EHIFNOTAVAIL  error  code  when  this  service  is 
invoked.  If  certain  vectors  are  restricted  from  being  set  by  this  service,  the  imple¬ 
mentation  should  check  the  trapno  parameter  and  return  the  EHIFNOTAVAIL  error 
code  for  references  to  restricted  trap  vectors. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

290 (0x122) 

Service  number 

lr2 

trapno 

Vector  number 

lr3 

trapaddr 

Address  of  trap  handler 

Returns: 

gr96 

trapaddr 

Address  of  previous  trap  handler 

grl21 

Example  Call 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number:  EHIFNOTAVAIL  if 
service  not  available 
(implementation  dependent) 

oldtrap : 

.  word 

0 

/placeholder  for  old  address 

const 

lr2, 54 

/ floating  divide  trap  vector 

const 

lr3 , new_div 

/ set  new_div  as  the 

consth 

lr3 , new_div 

/trap  handler  address 

const 

grl21, 290 

/service  number  290 

asneq 

69, grl , grl 

/call  the  OS 

jmpf 

grl2 1 , trap_err 

/ jump  if  error 

const 

grl20, oldtrap 

/  set  address  for  saving 

consth 

grl2  0 , oldtrap 

/the  old  trap  handler  address 

store 

0,0, gr96, grl20 

/ save  the  old  handler  address 

In  the  example  call,  a  new  handler  for  the  floating-point  division  operation  is 
being  installed.  If  the  implementation  returns  an  error,  the  program  jumps  to  the 
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trap_err  label.  If  the  service  was  successful  and  a  new  trap  handler  was  installed,  the 
previous  handler  address  (if  any)  is  stored  into  the  oldtrap  variable. 

There  is  often  a  need  for  programs  operating  on  dedicated  hardware  to  enter  su¬ 
pervisor  mode.  This  can  be  accomplished  by  reserving  a  trap  vector  for  that  purpose 
and  installing  a  trap  handler  routine  to  return  control  to  the  user  in  supervisor  mode. 
The  operation  is  effected  by  issuing  an  assert  instruction  that  invokes  the  specified 
trap.  User  mode  can  be  restored  by  clearing  (setting  to  0)  the  Supervisor  Mode  bit  (4) 
of  the  Current  Processor  Status  register  (CPS). 
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Service  291  -  setim 


Set  Interrupt  Mask 


Description 

This  service  provides  the  means  to  set  the  interrupt  mask  (IM)  field  and  the  dis¬ 
able  interrupts  (DI)  field  of  the  current  processor  status  register  (CPS).  This  field  en¬ 
ables  the  external  interrupt  pins  INTR0-INTR3,  according  to  the  following  encod¬ 
ing: 


00 

INTRO  enabled 

01 

INTR1 -INTRO  enabled 

10 

INTR2-INTR0  enabled 

11 

INTR3-INTR0  enabled 

These  two  bits  provide  for  a  priority-oriented  enabling  capability;  however,  the 
INTRO  interrupt  can  not  be  disabled  through  the  IM  field  alone.  The  disable  inter¬ 
rupts  (  di )  parameter  must  be  set  to  1  to  produce  this  effect.  A  di  value  of  0  will  en¬ 
able  the  selected  interrupts,  and  a  value  of  2  will  leave  the  Di-bit  of  the  CPS  un¬ 
changed.  If  this  service  is  not  implemented,  an  error  code  of  EHIFNOTAVAIL 
should  be  returned  by  the  software.  The  error  code  for  an  illegal  value  in  registers  lr2 
or  lr3  is  implementation  dependent. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

gr!21 

291 (0x123) 

Service  number 

lr2 

mask 

New  mask  field  value 

lr3 

di 

0=  Enable  interrupts 

1=  Disable  interrupts 

2=  Leave  interrupt  enable  unchanged 

Returns: 

gr96 

mask 

Old  mask  field  value 

gr!21 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number:  EHIFNOTAVAIL  if 
service  not  available 
(implementation  dependent) 
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Example  Call 

oldmask:  .word 

0 

const 

lr2 , 0x10 

const 

lr3 , 0x0 

const 

grl21, 291 

asneq 

69, grl , grl 

jmpf 

grl2 1 , mask_err 

const 

grl20, oldmask 

consth 

grl20,  oldmask 

store 

0,0, gr96, grl20 

/placeholder  for  old  mask  value 
/mask  =  10  (*INTR(2:0)  enable) 

/enable  interrupts  (di  =  0) 
/service  number  291 
/call  the  OS 
/ jump  if  error 
/ set  address  for  saving 
/the  old  IM  field  value 
/  save  the  old  IM  field  value 


In  the  example  call,  the  IM-field  of  the  current  processor  status  register  is  to  be 
set  to  10,  enabling  external  interrupt  pins  INTRO,  INTR1 ,  and  INTR2.  If  this  service 
is  not  available,  or  if  the  value  in  lr2  is  illegal,  the  service  will  return  an  error  code,  in 
which  case  the  program  jumps  to  the  mask_err  label.  If  the  service  execution  is  suc¬ 
cessful,  the  previous  contents  of  the  IM  field  is  stored  in  the  oldmask  variable. 
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Return  Version  Information 


Service  305  -  query 

Description 

This  service  returns  version  information,  or  capabilities  of  the  HIF  implementa¬ 
tion,  as  requested.  On  entry,  the  requested  capability  is  passed  as  an  argument  in  lr2. 
The  service  returns  the  requested  information,  or  indicates  that  it  is  unavailable,  in 
gr96. 


Register  Usage 


Type 

Regs 

Contents  Description 

Calling: 

gr!21 

305 (0x131) 

Service  number 

lr2 

capcode 

Capabilities  code 

0  =  Request  HIF  version 

1  =  Request  CPU  version  &  family 

code 

2  =  Request  Am29027  arithmetic 

accelerator  version 

3  =  Request  CPU  clock  frequency 

4  =  Request  memory  environment 

For  lr2 

0  (HIF  version) 

Return  v 

gr96 

hifvers 

Success:  >  0  (encoded  version 
information).  The  version  number 
is  returned  as  two  4-bit  fields  in  the 
low-order  8  bits  of  the  return  value. 
The  two  fields  are  separated  by  an 
implied  decimal  point  (e.g.,  0x20 
means  HIF  V2.0). 

Failure:  <  0  (or  unavailable) 

For  lr2  = 

1  (CPU  version  and  family  code) 

Returns: 

gr96 

epuvers 

Success:  >  0  (encoded 
version/family). The  high-order  8  bits 
of  the  configuration  register  (CFG), 
known  as  the  processor  release  level 
(PRL)  are  moved  to  the  low-order  8 
bits  of  gr96,  as  two  4-bit  fields. 
Failure:  <  0  (or  unavailable) 

For  lr2  = 

2  (Am29027  coprocessor  version) 
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Returns:  gr96  027vers  Success:  >  0  (encoded  version 

information).  The  high-order  8  bits 
of  the  accelerator’s  precision  register 
form  the  arithmetic  accelerator 
release  level  (ARL)  and  are  moved 
to  the  low-order  8  bits  of  gr96,  as 
two  4-bit  fields. 

Failure:  <  0  (or  unavailable) 

For  lr2  =  3  (CPU  clock  frequency) 

Returns:  gr96  clkfreq  Success:  >  0  (frequency  in  Hertz) 

Failure:  =  0  (or  unavailable) 

For  lr2  =  4  (Memory  environment) 

gr96  memenv  Success:  >  0  (memory  environment) 

BYTEW  Oxl  byte-write  available 
DWSET  0x2DW-bit  set 
IREAD  0x4  Instruction  memory 
readable 

Failure:  <  0  (or  unavailable) 

For  all  requests 

Returns:  grl21  0x80000000  Logical  TRUE,  service  successful 

errcode  error  number,  service  not  successful 

(implementation  dependent) 


In  addition  to  the  Return  Usage  table  requests,  negative  capcode  values  in  regis¬ 
ter  lr2  are  available  for  implementation-dependent  encoding  of  query  requests.  All 
positive  values  in  register  lr2  are  reserved  for  future  expansion  of  the  HIF  query  ser¬ 
vice. 

Example  Call 


.  word 

0 

const 

lr2, 0 

/request  HIF  version 

const 

grl21, 305 

/ service  =  305 

asneq 

69,  grl , grl 

/call  the  OS 

jmpf 

grl21, qry_err 

/handle  query  error 

const 

lr2 , vers 

/address  to  store  version  info 

consth 

lr2 , vers 

store 

0,0, gr96, lr2 

/ store  the  HIF  version  number 

In  the  example  call,  a  request  code  of  0  is  loaded  into  lr2  and  the  service  is  called. 
Upon  return,  if  the  value  in  grl21  is  FALSE,  indicating  failure,  the  program  jumps  to 
an  error  routine.  If  grl21  is  TRUE,  then  the  program  stores  the  returned  HIF  version 
information  into  the  variable  called  vers. 
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Service  321  -  signal 


Register  Signal  Handler 


Description 

This  service  provides  the  means  to  register  (or  un-register)  a  specified  user  sig¬ 
nal  handler.  Local  register  lr2  contains  the  address  of  the  user  signal  handler  routine 
on  entry.  This  routine  is  expected  to  handle  the  signals  shown  in  Table  A-2,  below. 
Sections  2.5,  4.4  and  Appendix  B,  contain  additional  information  on  writing  signal 
handlers  for  complex  environments. 

The  HIF  service  returns  the  address  of  the  previously  installed  handler  in  gr96. 
If  no  previous  handler  was  installed,  gr96  will  contain  a  NULL  pointer  ( gr96  =  0). 
Signal  handlers  may  perform  any  appropriate  processing,  but  only  the  services  with 
service  numbers  above  256  are  guaranteed  to  be  available.  Calls  to  services  with 
numbers  below  256  may  result  in  unpredictable  behavior  when  returning  to  the  inter¬ 
rupted  program — unless  the  service  executes  a  longjumpO,  which  avoids  execution 
of  the  interrupt  return  service  (see  Table  A-3). 

To  un-register  a  signal  handler,  local  register  lr2  must  contain  a  value  of  0 
(NULL)  on  entry.  When  a  handler  is  un-registered  in  this  manner,  signal  handling 
will  revert  to  the  default  behavior  established  by  the  operating  system. 

When  one  of  the  (SIGINT  or  SIGFPE)  signals  occurs,  the  HIF  implementation 
must  preserve:  the  signal  number  that  occurred;  the  register  stack  pointer  (grl):  the 
register  allocate  bounds  (grl26): the  program  counters,  PC0-PC2;  the  channel  regis¬ 
ters  ( CHA ,  CHD ,  and  CHC  );  the  ALU  register;  the  old  processor  status,  OPS;  and  the 
contents  of  grl21.  These  registers  are  saved  in  the  user  memory  stack.  The  HIF  im¬ 
plementation  must  be  careful  not  to  disturb  values  in  registers  that  have  not  been 
saved  on  the  user’s  stack.  Global  register  grl25  should  contain  the  address  of  the  last 
saved  value  in  the  HIF  Signal  Stack  (e.g.,  grl 21 )  at  the  conclusion  of  this  phase. 
Figure  A-l  illustrates  the  required  user  stack  format  for  saved  registers. 


Table  A-2.  Default  Signals  Handled  by  HIF 


Mnemonic  Value 


Description 


SIGINT  2  User  interrupt  (e.g.,  from  keyboard) 

SIGFPE  8  Floating-point  exception 
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Higher  Addresses 


User’s 

Stack 


Lower  Addresses 


signal  number 

grl 

gr126  (rab) 

PCO 

PCI 

PC2 

CHA 

CHD 

CHC 

ALU 

OPS 

grl  21  ( tav) 

Registers  Saved  by  HIF 


gr125  points  to  the 
last  register  saved 
by  the  HIF  in  the 
user's  stack. 


Figure  A-l.  HIF  Register  Preservation  for  Signals 


It  is  at  this  point  that  execution  of  the  HIF  invokes  the  handler  specified  by  the 
newsig  parameter  to  the  signal  service.  The  handler  is  invoked  with  the  processor 
mode  set  to  the  mode  of  the  interrupted  program  (either  user  or  supervisor).  Depend¬ 
ing  on  the  nature  of  the  interrupt  (SIGINT  or  SIGFPE)  and  the  complexity  of  the  han¬ 
dler,  additional  registers  may  need  to  be  saved.  In  this  case,  the  handler  must  pre¬ 
serve:  the  values  in  the  indirect  pointers  IPA,  IPB,  and  IPC;  the  contents  of  the  Q  reg¬ 
ister;  the  stack  frame  pointer,  lrl\  and  the  local  register  stack  free  bounds  in  rfb 
0 grl27 ).  In  addition,  because  high-level  languages  use  global  registers  gr96-grl24 
as  temporaries,  the  user  signal  handler  may  have  to  save  these  as  well. 

User  signal  handlers  can  be  grouped  into  three  levels  of  complexity,  depending 
on  the  implementation: 

1.  The  least  complex  are  handlers  which  have  no  intention  of  returning  control 
to  the  user.  In  this  case,  only  a  few  additional  registers  may  need  to  be  saved 
(if  any). 

2.  Floating-point  error  handlers  are  often  more  complex,  where  some  of  the 
user’s  context  needs  to  be  saved.  In  this  case,  probably  only  the  indirect  point¬ 
ers  (IPA-IPC),  the  Q  register,  and  grl25  need  be  preserved.  After  the  error 
has  been  handled,  the  handler  will  issue  one  of  the  signal  return  services  listed 
in  Table  A-3  to  return  control  to  the  user’s  program. 

3.  The  most  complex  handlers  will  be  those  that  need  to  return  to  the  user  pro¬ 
gram  at  the  C-level  of  context.  If  the  handler  intends  to  pass  control  to  a  user- 
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provided  signal  routine  (e.g.,  to  handle  SIGINT),  then  it  may  be  necessary  to 
preserve  all  of  the  registers  indicated  in  the  figure  above.  In  addition,  han¬ 
dlers  that  intend  to  return  control  at  the  C-level  of  context  will  need  to  make 
provision  for  completing  any  interrupted  SPILL  or  FILL  operations  or  com¬ 
plete  a  long-jump  that  may  be  in  progress.  Fortunately,  AMD  supplies  the 
necessary  code  in  library  routines  supplied  with  most  tool  products  (see  Ap¬ 
pendix  B). 

Before  execution  of  the  signal  handler,  the  HIF  conforming  operating  system  is 
responsible  for  clearing  the  Channel  Control  (CHC)  register  (setting  it  to  0),  to  pre¬ 
vent  restarting  a  load  or  store  multiple  operation  that  may  have  been  interrupted.  The 
proper  contents  of  this  register  will  be  restored  by  the  HIF  when  the  handler  issues 
one  of  the  service  requests  listed  in  Table  A-3. 

Once  a  signal  handler  is  invoked  by  one  of  the  signals  listed  in  Table  A-2,  and 
when  it  has  finished,  it  will  usually  return  to  the  operating  system  by  invoking  one  of 
the  signal  return  services  shown  in  Table  A-3,  below,  with  register  grl25  pointing  to 
the  last  saved  register  in  the  HIF-saved-registers  (i.e.,  grl21  ),  as  shown  in  Fig¬ 
ure  A-l .  More  complex  implementations  may  make  other  arrangements  for  return¬ 
ing  to  the  user  program’s  context.  Sample  code  for  saving  and  restoring  the  necessary 
registers  is  shown  in  section  4.4  and  Appendix  B. 

The  handler  is  responsible  for  determining  the  appropriate  action  for  each  type 
of  interrupt  (SIGINT  or  SIGFPE),  and  must  return  control  to  the  operating  system 
using  one  of  the  services  listed  in  Table  A-3,  after  first  restoring  the  indirect  pointers 
(IPA-IPC),  the  Q  register,  and  with  grl25  pointing  to  the  last  saved  register  in  the 
user’s  stack  (assuming  the  suggested  approach  for  preserving  registers  is  followed). 


Table  A-3.  HIF  Signal  Return  Services 


Service 

Name 

Description 

322 

sigdfl 

Perform  default  signal  handling 

323 

sigret 

Return  to  location  indicated  by  PCI 

324 

sigrep 

Return  to  location  indicated  by  PC2 

325 

sigskp 

Return  to  location  indicated  by  PCO 
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Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

321 (0x141) 

Service  number 

lr2 

newsig 

Address  of  signal  handler,  or  NULL 
pointer 

Returns: 

gr96 

oldsig 

Old  handler  address 

gr!21 

0x80000000 

errcode 

Logical  TRUE,  service  successful 
Error  number,  service  not  successful 
(implementation  dependent) 

Example  Call 

oldhdlr : 

.  word 

0 

const 

consth 

const 

asneq 

jmpf 

const 

consth 

store 

lr2 , user_sigs 
lr2 , user_sigs 
grl21, 321 

69, grl , grl 
grl21, sig_err 
grl2  0 , oldhdlr 
grl2  0 , oldhdlr 
0,0, gr96, grl20 

/address  of  user  signal  handler 

/ service  =  321 

/call  the  OS  to  install  handler 
/ jump  to  handle  error 
/ setup  address  to  store  old 
/handler  address 
/ store  the  old  handler  address 

In  the  example  call,  a  user  signal  handler  whose  entry  point  name  is  user_sigs  is 
installed.  When  the  service  returns,  if  grl21  contains  a  FALSE  value,  the  program 
jumps  to  an  error  routine;  otherwise,  the  address  of  the  previously  installed  handler 
returned  in  gr96  is  stored  in  the  local  variable  oldhdlr. 
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Perform  Default  Signal  Action 


Service  322  -  sigdfl 

Description 

This  service  is  called  only  from  within  a  user  signal  handler  installed  using  the 
signal  (321)  service.  The  function  of  this  service  is  to  instruct  the  HIF  to  perform  the 
predetermined  default  action  for  the  specified  signal.  The  operating  system  is  re¬ 
sponsible  for  establishing  the  appropriate  default  action. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

322  (0x142) 

Service  number 

grl25 

sigptr 

Pointer  to  HIF  Signal  Stack 
containing  preserved  registers 
(See  signal  (321)  for  further 
information.) 

Returns: 

Does  not  return 

Example  Call 

const  grl21,322  /service  =  322 

asneq  69,grl,grl  /call  the  OS 


Since  this  service  does  not  return,  no  attempt  is  made  to  test  returned  values  or 
store  results. 
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Return  From  Signal  Interrupt 


Service  323  -  sigret 

Description 

This  service  is  called  only  from  within  a  user  signal  handler  installed  using  the 
signal  (321)  service.  The  function  of  this  service  is  to  return  from  the  latest  signal 
interrupt,  to  the  location  specified  by  the  value  in  program  counter  PC  1  at  the  time  the 
signal  occurred.  Once  invoked,  this  service  does  not  return  to  the  user  signal  handler. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

323 (0x143) 

Service  number 

grl25 

sigptr 

Pointer  to  HIF  Signal  Stack 
containing  preserved  registers 
(See  signal  (321)  for  further 
information.) 

Returns: 

Does  not  return 

Example  Call 

const  grl21,323  /service  =  323 

asneq  69,grl,grl  /call  the  OS 


Since  this  service  does  not  return,  no  attempt  is  made  to  test  returned  values  or 
store  results. 
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Return  From  Signal  Interrupt 


Service  324  -  sigrep 

Description 

This  service  is  called  only  from  within  a  user  signal  handler  installed  using  the 
signal  (321)  service.  The  function  of  this  service  is  to  return  from  the  latest  signal 
interrupt  to  the  location  specified  by  the  value  in  program  counter  PC2  at  the  time  the 
signal  occurred.  Once  invoked,  this  service  does  not  return  to  the  user  signal  handler. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

324 (0x144) 

Service  number 

grl25 

sigptr 

Pointer  to  HIF  Signal  Stack 
containing  preserved  registers 
(See  signal  (321)  for  further 
information.) 

Returns: 

Does  not  return 

Example  Call 

const  grl21,324  /service  =  324 

asneq  69,grl,grl  /call  the  OS 


Since  this  service  does  not  return,  no  attempt  is  made  to  test  returned  values  or 
store  results. 
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Return  From  Signal  Interrupt 


Service  325  -  sigskp 

Description 

This  service  is  called  only  from  within  a  user  signal  handler  installed  using  the 
signal  (321)  service.  The  function  of  this  service  is  to  return  from  the  latest  signal 
interrupt  to  the  location  specified  by  the  value  in  program  counter  PCO  at  the  time  the 
signal  occurred.  Once  invoked,  this  service  does  not  return  to  the  user  signal  handler. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

325  (0x145) 

Service  number 

grl25 

sigptr 

Pointer  to  HIF  Signal  Stack 
containing  preserved  registers 
(See  signal  (321)  for  further 
information.) 

Returns: 

Does  not  return 

Example  Call 

const  grl21,325  /service  =  325 

asneq  69,grl,grl  /call  the  OS 


Since  this  service  does  not  return,  no  attempt  is  made  to  test  returned  values  or 
store  results. 
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Send  Signal 


Service  326  -  sendsig 

Description 

This  service  provides  the  means  to  send  a  signal  to  the  current  process,  to  sup¬ 
port  signal  testing.  A  single  parameter,  sig,  specifies  the  signal  number  to  be  sent. 


Register  Usage 


Type 

Regs 

Contents 

Description 

Calling: 

grill 

326  (0x141) 

Service  number 

process 

lr2 

sig 

Signal  number  to  be  sent  to  current 

Returns: 

grill 

0x80000000 

errcode 

Fogical  TRUE,  service  successful 
Error  number,  service  not  successful 

EHIFNOTAVAIL  if  service  not 
implemented. 

(implementation  dependent) 


Example  Call 


const 

lr2 , SIGFPE 

; floating-point  exception 

const 

grl21, 326 

; service  =  326 

asneq 

69, grl , grl 

;call  the  OS 

jmpf 

gr!21, send_err 

/handle  signaling  error 

nop 

In  the  above  example,  a  floating-point  exception  error  signal  is  being  sent  to  the 
current  process.  It  is  presumed  that  a  signal  handler  for  the  SIGFPE  (floating-point 
exception)  error  has  been  previously  installed  (see  signal  service)  and  is  being  tested. 
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A.2  Error  Numbers 


HIF  implementations  are  required  to  return  error  codes  when  a  requested  opera¬ 
tion  is  not  possible.  The  codes  from  0-10,000  are  reserved  for  compatibility  with 
current  and  future  error  return  standards.  The  currently  assigned  codes  and  their 
meanings  are  shown  in  Tables  A-4  through  A-9.  If  a  HIF  implementation  returns  an 
error  code  in  the  range  of  0-10,000,  it  must  carry  the  identical  meaning  to  the  corre¬ 
sponding  error  code  in  this  table.  Error  code  values  larger  than  10,000  are  available 
for  implementation- specific  errors. 


Table  A-4.  HIF  Error  Numbers  Assigned 


Number 

Name 

Description 

0 

Not  used. 

1 

EPERM 

Not  owner 

This  error  indicates  an  attempt  to  modify  a  file  in  some  way  forbidden 
except  to  its  owner. 

2 

ENOENT 

No  such  file  or  directory 

This  error  occurs  when  a  file  name  is  specified  and  the  file  should  exist  but 
does  not,  or  when  one  of  the  directories  in  a  pathname  does  not  exist. 

3 

ESRCH 

No  such  process 

The  process  or  process  group  whose  number  was  given  does  not  exist,  or 
any  such  process  is  already  dead. 

4 

EINTR 

Interrupted  system  call 

This  error  indicates  that  an  asynchronous  signal  (such  as  interrupt  or  quit) 
that  the  user  has  elected  to  catch  occurred  during  a  system  call. 

5 

EIO 

I/O  error 

Some  physical  I/O  error  occurred  during  a  read  or  write.  This  error  may  in 
some  cases  occur  on  a  call  following  the  one  to  which  it  actually  applies. 

6 

ENXIO 

No  such  device  or  address 

I/O  on  a  special  file  refers  to  a  sub-device  that  does  not  exist  or  is  beyond 
the  limits  of  the  device. 

7 

E2BIG 

Arg  list  is  too  long 

An  argument  list  longer  than  5120  bytes  is  presented  to  executive. 

8 

ENOEXEC 

Exec  format  error 

A  request  is  made  to  execute  a  file  that,  although  it  has  the  appropriate 
permissions,  does  not  start  with  a  valid  magic  number. 

9 

EBADF 

Bad  file  number 

Either  a  file  descriptor  refers  to  no  open  file,  or  a  read  (write)  request  is 
made  to  a  file  that  is  open  only  for  writing  (reading). 
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Table  A-5.  HIF  Error  Numbers  Assigned  ( continued ) 


Number 

Name 

Description 

10 

ECHILD 

No  children 

Wait  and  the  process  has  no  living  or  unwaited  for  children. 

11 

EAGAIN 

No  more  processes 

In  a  fork,  the  system’s  process  table  is  full,  or  the  user  is  not  allowed  to 
create  any  more  processes. 

12 

ENOMEM 

Not  enough  memory 

During  an  executive  or  break,  a  program  asks  for  more  memory  than  the 
system  is  able  to  supply  or  else  a  process  size  limit  would  be  exceeded. 

13 

EACCESS 

Permission  denied 

An  attempt  was  made  to  access  a  file  in  a  way  forbidden  by  the  protection 
system. 

14 

EFAULT 

Bad  address 

The  system  encountered  a  hardware  fault  in  attempting  to  access  the 
arguments  of  a  system  call. 

15 

ENOTBLK 

Block  device  required 

A  plain  file  was  mentioned  where  a  block  device  was  required,  such  as  in 
mount. 

16 

EBUSY 

Device  busy 

An  attempt  was  made  to  mount  a  device  that  was  already  mounted,  or  an 
attempt  was  made  to  dismount  a  device  on  which  there  is  an  active  file 
(open  file,  current  directory,  mounted-on  file,  or  active  text  segment). 

17 

EEXIST 

File  exists 

An  existing  file  was  mentioned  in  an  inappropriate  context  (e.g.,  link). 

18 

EXDEV 

Cross-device  link 

A  hard  link  to  a  file  on  another  device  was  attempted. 

19 

ENODEV 

No  such  device 

An  attempt  was  made  to  apply  an  inappropriate  system  call  to  a  device,  for 
example,  to  read  a  write-only  device,  or  the  device  is  not  configured  by  the 
system. 

20 

ENOTDIR 

Not  a  directory 

Anon-directory  was  specified  where  a  directory  is  required,  for  example,  in 
a  path  name  or  as  an  argument  to  chdir. 

21 

EISDIR 

Is  a  directory 

An  attempt  to  write  on  a  directory. 
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Table  A-6.  HIF  Error  Numbers  Assigned  ( continued ) 


Number 

Name 

Description 

22 

EINVAL 

Invalid  argument 

This  error  occurs  when  some  invalid  argument  for  the  call  is  specified.  For 
example,  dismounting  a  non-mounted  device,  mentioning  an  unknown 
signal  in  signal,  or  specifying  some  other  argument  that  is  inappropriate  for 
the  call. 

23 

ENFILE 

File  table  overflow 

The  system's  table  of  open  files  is  full,  and  temporarily  no  more  open 
requests  can  be  accepted. 

24 

EMFILE 

Too  many  open  files 

The  configuration  limit  on  the  number  of  simultaneously  open  files  has 
been  exceeded. 

25 

ENOTTY 

Not  a  typewriter 

The  file  mentioned  in  stty  or  gtty  is  not  a  terminal  or  one  of  the  other 
devices  to  which  these  calls  apply. 

26 

ETXTBSY 

Text  file  busy 

The  referenced  text  file  is  busy  and  the  current  request  can  not  be  honored. 

27 

EFBIG 

File  too  large 

The  size  of  a  file  exceeded  the  maximum  limit. 

28 

ENOSPC 

No  space  left  on  device 

A  write  to  an  ordinary  file,  the  creation  of  a  directory  or  symbolic  link,  or  the 
creation  of  a  directory  entry  failed  because  no  more  disk  blocks  are 
available  on  the  file  system. 

29 

ESPIPE 

Illegal  seek 

A  seek  was  issued  to  a  socket  or  pipe.  This  error  may  also  be  issued  for 
other  non-seekable  devices. 

30 

EROFS 

Read-only  file  system 

An  attempt  to  modify  a  file  or  directory  was  made  on  a  device  mounted 
read-only. 

31 

EMLINK 

Too  many  links 

An  attempt  was  made  to  establish  a  new  link  to  the  requested  file  and  the 
limit  of  simultaneous  links  has  been  exceeded. 

32 

EPIPE 

Broken  pipe 

A  write  on  a  pipe  or  socket  was  attempted  for  which  there  is  no  process  to 
read  the  data.  This  condition  normally  generates  a  signal;  the  error  is 
returned  if  the  signal  is  caught  or  ignored. 

33 

EDOM 

Argument  too  large 

The  argument  of  a  function  in  the  math  package  is  out  of  the  domain  of  the 
function. 
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Table  A-7 .  HIF  Error  Numbers  Assigned  ( continued ) 

Number  Name  Description 

34  ERANGE  Result  too  large 

The  value  of  a  function  in  the  math  package  is  unrepresentable  within 
machine  precision. 

35  EWOULDBLOCK  Operation  would  block 

An  operation  that  would  cause  a  process  to  block  was  attempted  on  an 
object  in  non-blocking  mode. 

36  EINPROGRESS  Operation  now  in  progress 

An  operation  that  takes  a  long  time  to  complete  was  attempted  on  a  non- 
blocking  object. 

37  EALREADY  Operation  already  in  progress 

An  operation  was  attempted  on  a  non-blocking  object  that  already  had  an 
operation  in  progress. 

38  ENOTSOCK  Socket-operation  on  non-socket 

A  socket  oriented  operation  was  attempted  on  a  non-socket  device. 

39  EDESTADDRREQ 

Destination  address  required 

A  required  address  was  omitted  from  an  operation  on  a  socket. 

40  EMSGSIZE  Message  too  long 

A  message  sent  on  a  socket  was  larger  than  the  internal  message  buffer  or 
some  other  network  limit. 

41  EPROTOTYPE  Protocol  wrong  type  for  socket 

A  protocol  was  specified  that  does  not  support  the  semantics  of  the  socket 
type  requested. 

42  ENOPROTOOPT  Option  not  supported  by  protocol 

A  bad  option  or  level  was  specified  when  accessing  socket  options. 

43  EPROTONOSUPPORT 

Protocol  not  supported 

The  protocol  has  not  been  configured  into  the  system,  or  no 
implementation  for  it  exists. 

44  ESOCKTNOSUPPORT 

Socket  type  not  supported 

The  support  for  the  socket  type  has  not  been  configured  into  the  system,  or 
no  implementation  for  it  exists. 

45  EOPNOTSUPP  Operation  not  supported  on  socket 

For  example,  trying  to  accept  a  connection  on  a  datagram  socket. 

46  EPFNOSUPPORT 

Protocol  family  not  supported 

The  protocol  family  has  not  been  configured  into  the  system  or  no 
implementation  for  it  exists. 
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Table  A-8.  HIF  Error  Numbers  Assigned  ( continued ) 

Number  Name  Description 

47  EAFNOSUPPORT 

Address  family  not  supported  by  protocol  family 

An  address  was  used  that  is  incompatible  with  the  requested  protocol. 

48  EADDRINUSE  Address  already  in  use 

Only  one  usage  of  each  address  is  normally  permitted. 

49  EADDRNOTAVAIL 

Cannot  assign  requested  address 

This  normally  results  from  an  attempt  to  create  a  socket  with  an  address 
not  on  this  machine. 

50  ENETDOWN  Network  is  down 

A  socket  operation  encountered  a  dead  network. 

51  ENETUNREACH  Network  is  unreachable 

A  socket  operation  was  attempted  to  an  unreachable  network. 

52  ENETRESET  Network  dropped  connection  on  reset 

The  host  you  were  connected  to  crashed  and  rebooted. 

53  ECONNABORTED 

Software  caused  connection  abort 

A  connection  abort  was  caused  internal  to  your  host  machine. 

54  ECONNRESET  Connection  reset  by  peer 

A  connection  was  forcibly  closed  by  a  peer.  This  normally  results  from  a 
loss  of  the  connection  on  the  remote  socket  due  to  a  timeout  or  a  reboot. 

55  ENOBUFS  No  buffer  space  available 

An  operation  on  a  socket  or  pipe  was  not  performed  because  the  system 
lacked  sufficient  buffer  space  or  because  a  queue  was  full. 

56  EISCONN  Socket  is  already  connected 

A  connect  request  was  made  on  an  already  connected  socket;  or  a  sendto 
or  sendmsg  request  on  a  connected  socket  specified  a  destination  when 
already  connected. 

57  ENOTCONN  Socket  is  not  connected 

A  request  to  send  or  receive  data  was  disallowed  because  the  socket  was 
not  connected  and  (when  sending  on  a  datagram  socket)  no  address  was 
supplied. 

58  ESHUTDOWN  Cannot  send  after  socket  shutdown 

A  request  to  send  data  was  disallowed  because  the  socket  had  already 
been  shut  down  with  a  previous  shutdown  call. 

59  ETOOMANYREFS 

Too  many  references;  cannot  splice. 
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Table  A-9.  HIF  Error  Numbers  Assigned  ( concluded ) 

Number  Name  Description 

60  ETIMEDOUT  Connection  timed  out 

A  connect  or  send  request  failed  because  the  connected  party  did  not 
properly  respond  after  a  period  of  time.  (The  timeout  period  is  dependen 
on  the  communication  protocol.) 

61  ECONNREFUSED 

Connection  refused 

No  connection  could  be  made  because  the  target  machine  actively 
refused  it.  This  usually  results  from  trying  to  connect  to  a  service  that  is 
inactive  on  the  foreign  host. 

62  ELOOP  Too  many  levels  of  symbolic  links 

A  pathname  lookup  involved  more  than  the  maximum  limit  of  symbolic 
links. 

63  ENAMETOOLONG 

File  name  too  long 

A  component  of  a  pathname  exceeded  the  maximum  name  length,  or  an 
entire  pathname  exceeded  the  maximum  path  length. 

64  EHOSTDOWN  Host  is  down 

A  socket  operation  failed  because  the  destination  host  was  down. 

65  EHOSTUN REACH 

Host  is  unreachable 

A  socket  operation  was  attempted  to  an  unreachable  host. 

66  ENOTEMPTY  Directory  not  empty 

A  non-empty  directory  was  supplied  to  a  remove  directory  or  rename  call. 

67  EPROCLIM  Too  many  processes 

The  limit  of  the  total  number  of  processes  has  been  reached.  No  new 
processes  can  be  created. 

68  EUSERS  Too  many  users 

The  limit  of  the  total  number  of  users  has  been  reached.  No  new  users  may 
access  the  system. 

69  EDQUOT  Disk  quota  exceeded 

A  write  to  an  ordinary  file,  the  creation  of  a  directory  or  symbolic  link,  or  the 
creation  of  a  directory  entry  failed  because  the  user’s  quota  of  disk  blocks 
was  exhausted;  or  the  allocation  of  an  inode  for  a  newly  created  file  failed 
because  the  user's  quota  of  inodes  was  exhausted. 

70  EVDBAD  RVD  related  disk  error 

1001  EHIFNOTAVAIL  HIF  service  not  available. 

The  requested  HIF  service  is  not  implemented  or  is  not  available  to  the 
user  program  making  the  request. 

1002  EHIFUNDEF  HIF  service  is  undefined 

The  HIF  service  referenced  by  the  program  is  undefined.  No  valid  HIF 
service  with  that  service  number  exists. 
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HIF  Signal  Processing 


B.1  User  Trampoline  Code 

The  following  29K  assembly  language  code  is  a  listing  of  the  User  mode  code 
required  to  complete  signal  handler  preparation  tasks  necessary  for  an  HIF  compliant 
operating  system.  When  address  sigcode  is  reached,  the  operating  system  has  already 
saved  the  portion  of  the  interrupt  context  frame  down  to  the  tav  register  onto  the 
user’s  memory  stack  (see  section  4.4).  This  source  code  is  contained  in  library  files 
delivered  with  the  High  C  29K  product,  and  is  listed  here  for  reference  and  to  aid 
comprehension  of  the  way  different  code  fragments  fit  together  in  a  real  implementa¬ 
tion. 

Implementors  who  do  not  wish  to  use  an  array  of  signal  handlers,  but  a  single 
C-level  handler  function,  can  change  the  code  in  the  call  C-leve\  section.  They  need 
not  access  the  SigEntry  array  to  obtain  the  address  of  their  handler  function. 

.file  "signal. s" 

;  SigEntry  is  the  address  of  an  array  of  C-level  user  code  signal 
;  handlers.  They  must  return  to  the  top-level  before  doing  a 
;  sigretO  return  function.  Nested  signals  are  supported. 

. externV_SPILL, V_FILL 

.extern  fill_handler  ;  In  crtO.s 

.align  4 

.comm  WindowSize, 4 
.  data 
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. global 

_ SigEntry 

_ SigEntry 

SigEntry : 

.  word 

0 

; reserved 

.  word 

0 

; adds .  of 

#2 

SIGINT  handler 

.  word 

0 

; reserved 

.  word 

0 

;  reserved 

.  word 

0 

; reserved 

.  word 

0 

;  reserved 

.  word 

0 

; reserved 

.  word 

0 

; adds .  of 

#8 

SIGFPE  handler 

.  word 

0 

;  reserved 

.  word 

0 

;  reserved 

.  word 

0 

;  reserved 

.  word 

0 

; reserved 

.  word 

0 

;  reserved 

.  word 

0 

;  reserved 

#14 

SIGALARM 

.rep  29  -14 

.  word 

0 

;  reserved 

.  endr 

.  word 

0 

; reserved 

#30 

SIGUSR1 

.  word 

0 

; reserved 

#31 

SIGUSR2 

.  word 

0 

r 

.macro  push, 
sub 

store 

.  endm 

sp, reg 
sp, sp, 4 

0,0, reg, sp 

; decrement  pointer 
; store  on  stack 

r 

.macro  pushsr, sp, reg, sreg 
mfsr  reg, sreg 

sub  sp, sp, 4 

store  0,0, reg, sp 

.  endm 

; copy  from  special  register 
; decrement  pointer 
; store  on  stack 

r 

.  macro 

load 

add 

.  endm 

pop, reg, sp 

0,0, reg, sp 
sp, sp, 4 

; get  from  stack 
; increment  pointer 

r 

.  macro 

load 

add 

mtsr 

.  endm 

popsr, sreg, reg 
0,0, reg, sp 
sp, sp, 4 
sreg, reg 

,  sp 

; get  from  stack 
; increment  pointer 
;move  to  special  register 

.  reg 

.  reg 

vO, gr96 
vl , gr 97 

Macros 
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.  reg  v2 , gr 98 
.  reg  v3,gr99 


; =================================================  Process  Signal 

;  About  to  deliver  a  signal  to  a  user  mode  signal  handler. 

;  The  state  of  all  the  registers  (except  for  msp,chc  and  rab) 

;  is  the  same  as  when  the  process  was  interrupted. 

r 

;  We  must  make  the  stack  and  window  consistent  before  calling  the 
;  handler.  The  orignal  rab  value  is  on  the  stack.  The  interrupt 
;  handler  placed  rfb-Windowsize  in  rab.  This  is  required  to 
;  support  nested  interrupts. 

r 

;  Note  that  the  window  becomes  inconsistent  only  during  certain 
;  critical  sections  in  spill, fill, long jmp  and  sigcode. 

;  rfb  -  rab  >  windowsize  =>  we  are  in  spill 

;  rfb  -  rab  <  windowsize  =>  we  are  in  fill 

;  grl  +8  >  rfb  =>  we  are  in  long-longjmp  case 

;  In  case  of  spill, fill  and  lonjmp;  rab  is  modified  first, 

;  so  if  we  are  in  one  of  these  critical  sections, 

;  we  set  rab  to  rfb  -  WINDOWSIZE. 

r 

.equ  SIGCTX_UM_SIZE,  39*4 

.equ  SIGCTX_RFB, (37) *4  ;User-Mode  save 

r 

.equ  SIGCTX_SM_SIZE,  12*4  ; Supervisor-Mode  saved 

.equ  SIGCTX_SIG, (11) *4  +  SIGCTX_UM_SIZE 

.equ  SIGCTX_GR1, (10) *4  +  SIGCTX_UM_SIZE 

.equ  SIGCTX_RAB, (9) *4  +  SIGCTX_UM_SIZE 

.equ  SIGCTX_PC0, (8) *4  +  SIGCTX_UM_SIZE 

.equ  SIGCTX_PC1, (7) *4  +  SIGCTX_UM_SIZE 

.equ  SIGCTX_PC2, (6) *4  +  SIGCTX_UM_SIZE 

.equ  SIGCTX_CHC, (3) *4  +  SIGCTX_UM_SIZE 

.equ  SIGCTX_OPS, (1) *4  +  SIGCTX_UM_SIZE 

.equ  SIGCTX_TAV, (0) *4  +  SIGCTX_UM_SIZE 


.  global sigcode 


sigcode : 

push  msp, lrl 
push  msp, rfb 
push  msp, msp 
sub  msp, msp, 3*4 


-  sigcode 

;push  R-stack  support 

;M-stack  support 
; space  for  Floating  Point 


pushsr  msp,tav,IPA  ;User  mode  specials 

pushsr  msp,tav,IPB 
pushsr  msp,tav,IPC 
pushsr  msp,tav,Q 
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sub  msp,msp,29*4  ;push  gr96-grl24 

mtsrim  cr, 29-1 
storem  0,0,gr96,msp 


; -  R-Stack  fixup 

R_f ixup : 

const  vO , WindowSize  ; get  register  cache  size 

consth  vO , WindowSize 

load  0 , 0 , vO , vO 

add  v2 , msp, SIGCTX_RAB 

load  0,0,v2,v2  ; get  interrupted  rab  value 

sub  vl,rfb,v2  ; determine  rfb-rab<=WINDOW_SIZE 

cpgeu  vl,vl,vO  ; 

jmpt  vl,nfill  ; jmp  if  spill  or  normal  interrupt 

add  vl , grl , 8 

cpgtu  vl,vl,rfb  ; interrupted  long jmp  can  look  like 

jmpt  vl,nfill  ;fill,  test  for  long-longjmp 

nop  interruption,  jmp  if  grl  +  8  >  rfb 

;  Fixup  signal  stack  to  re-start  interrupted  fill 
;  backup  pci  —  this  is  needed  for  the  partial  fill  case. 

;  Clear  chc  so  an  interrupted  load/store  does  not  restart. 

;  Reset  rab  to  a  window  distance  below  rfb, rab  shall  be 
;  decremented  again  on  re-starting  the  interrupted  fill. 

;  The  interrupt  handler  set  rab=rfb-WindowSize . 

r 

if  ill : 


add 

vO , msp, SIGCTX_RAB 

+  4 

push 

vO , rab 

; resave  rab=rfb-WindowSize 

const 

v2 , fill  +4 

consth 

v2 , fill  +4 

push 

vO ,  v2 

; resave  PCO 

sub 

v2 , v2 , 4 

push 

vO ,  v2 

; resave  PCI 

const 

v2 ,  0 

sub 

vO, vO, 3*4 

; point  to  CHC 

push 

vO ,  v2 

; resave  CHC=0 

r 

nf ill : 

cpgtu 

vO , grl , rfb 

;if  grl  >  rfb  then  grl  =  rfb 

jmpt 

vO , lower 

cpltu 

vO , grl , rab 

;if  grl  <  rab  then  grl  =  rab 

jmpt 

vO , raise 

nop 

•p.  pi  _  i  i 

i 

sendsig : 

.  equ 

RALLOC, 4*4 

/make  space  for  function  calls 

sub 

grl , grl , RALLOC 

asgeu 

V_SPILL, grl, rab 

add 

lrl , rfb, 0 

; set  lrl  =  rfb 
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add 

load 

sub 

sll 


vl,msp,  SIGCTX_SIG 
0 , 0 , lr2 , vl 
vl , lr2 , 1 
vl ,  vl ,  2 


/restore  signal  number 
;  get  handler  index 
/point  to  addresses 


/ -  call  C-level 

/Handler  must  not  use  HIF  services  other  than  the  _sigret()  type, 
const  vO,SigEntry 
consth  vO,SigEntry 
add  vO,vO,vl 

load  0,0,v0,v0  /determine  if  handler  registered 

cpeq  vl , vO , 0 

jmpt  vl,NoHandler 
nop 

calli  Ir0,v0  /call  C-level  signal  handler 

nop 


NoHandler : 

jmp  _ sigdfl 

nop 


default  return 


raise : jmp 
add 

lower : jmp 
add 


sendsig 
grl , rab, 0 
sendsig 
grl , rfb, 0 


.macro 

mtsrim 

loadm 

add 

popsr 

popsr 

popsr 

popsr 

add 

add 

pop 

.  endm 


repair_regs 
cr , 29-1 
0,0, gr96, msp 
msp, msp, 29*4 

Q, tav, msp 
IPC, tav, msp 
IPB, tav, msp 
IPA, tav, msp 

msp, msp, 3*4 

msp, msp, 2*4 
lrl , msp 


support  code 


repair_regs 


/restore  gr96-grl24 


/restore  special  registers 


/ space  for  Floating  Point 
/R-stack  already  repaired 
/ repair  msp  to  User  mode 
/ signal  entry  value 
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; -  repair_R_stack 

;  The  handler  function  may  request  a  signal  return  OS 
;  service, therefor  avoiding  the  FILL  in  the  prologue  of 
;  the  handler  function.  The  code  below  replaces  the  possibly 
;  omitted  prologue, required  to  return  the  register  stack  to 
;  the  position  at  which  it  was  interrupted  ('grl  the 
;  is  interrupted  register  stack  pointer.) 

;  A  single  FILL  may  not  be  able  to  restore  the  complete  stack 
;  as  a  FILL  can  only  restore  126  registers. 

r 

;  if  'grl  <  ' rfb-WindowSize 

Yes  LB  =  'rfb-WindowSize, (signaled  during  a  SPILL) 

;  No  LB  =  'grl, (interrupted  cache  <  126  regs) 

r 

;  if  LB  =<  rfb  Yes  FILL  from  rfb  to  ' rfb  (partial  cache  FILL) 

;  No  FILL  from  LB  to  'rfb  (complete  cache  FILL) 


;  If  this  *FILL*  where  interrupted  we  have  no  means  of  restarting 
;  it  like  an  ordinary  FILL.  So  we  make  sure  no  registers  are 
;  damaged  by  an  interrupt  by  setting  grl=rab  for  the  duration 
;  of  the  *FILL* .  this  marks  the  cache  as  fully  in  use.  A  SPILL 
;  would  be  generated  by  an  Interrupt  of  the  *FILL*. 


.macro  repair_R_stack 

add  gr96, msp, SIGCTX_GR1 

load  0, 0, gr98, gr96  ;gr98  =  interrupted  'grl 

add  gr96, msp, SIGCTX_RFB 

load  0, 0, gr99, gr96  ;gr99  =  interrupted  'rfb 


;  if 


$1  : 

;  if 


$2: 


'  grl  < 


const 

sub 

cpltu 

jmpf 

add 

add 

LB  =<  r 

cpleu 

jmpf 

nop 

add 

cpeq 

jmpt 

const 

or 


'  rfb-WindowSize 

Yes  LB  =  'rfb-WindowSize, (signaled  during  a  SPILL) 
No  LB  =  'grl, (interrupted  cache  <  126  regs) 

gr 97 , 512 

/ ' rfb-WindowSize 
;test  'grl  <  'rfb-WindowSize 
/initialise  LB='grl 
/mark  cache  all  in  use 
/ set  LB=' rfb-WindowSize 
/LB  in  gr98 

rfb  to  'rfb  (partial  cache  FILL) 
LB  to  'rfb  (complete  cache  FILL) 
/test  LB  =<  rfb 

/default  complete  fill  LB->'rfb 


gr97, gr99, gr97 
gr96, gr98, gr97 
gr96, $1 
grl , rab, 0 
gr 98 , gr 97 , 0 

fb  Yes  FILL  from 
No  FILL  from 
gr96, gr98, rfb 
gr96, $2 


gr 98 , rfb, 0 

gr96, gr99, rfb 
gr96, $3 
tav,  (0x80<<2) 
tav, tav, gr 98 


/fill  rfb->'rfb  (partial) 
/lower  fill  adds. (LA)  in  gr98 
/test  if  ' rfb==rfb,  jump  if 
/  partial  fill  is  zero  in  size 
/prepare  for  *FILL* 
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mtsr 

IPA, tav 

/ ipa=  LA<<2 

sub 

tav, gr99, gr98 

; cache  fill  LA->'rfb 

srl 

tav, tav, 2 

; convert  to  words 

sub 

tav, tav,  1 

mtsr 

cr , tav 

loadm 

0,0, grO, gr98 

; f ill  from  LA->'rfb 

add 

rfb, gr 9  9 , 0 

; move  rfb  up  to  'rfb 

sub 

rab, gr 97 , 0 

/assign  rab  to  ' rfb-WindowSi 

add 

gr96,  msp, SIGCTX_ 

_GR1 

load 

0,0,  gr98,  gr96 

;gr98  =  interrupted  'grl 

add 

nop 

endm 

grl , gr98 , 0 

/move  grl  up  to  'grl 

B.2  Library  Glue  Routines  to  HIF  Signal  Services 


ze 


The  following  five  assembly  level  routines  are  used  by  C  language  application 
programs  to  request  HIF  services  supporting  signals.  The  first  four  services  are  used 
to  cause  signal  handler  termination.  Except  for  _sigdfl(),  the  difference  between 
these  services  is  in  how  the  processor  PC  buffer  registers  will  be  restored.  It  is  pos¬ 
sible  to  restart  an  instruction  or  skip  the  instruction  being  executed  at  the  time  signal 
processing  started  (see  Appendix  A).  The  _sigdfT()  routine  is  used  to  request  the  de¬ 
fault  HIF  operating  service  return  service.  The  action  taken  is  operating  system  im¬ 
plementation  dependent.  The  _sigsend()  service  can  be  used  request  a  signal  be  sent 
to  the  HIF  application.  This  is  useful,  as  it  enables  a  software  generated  signal  to  test- 
out  the  signal  handling  system. 


. global _ sigret 

sigret : 

repair_R_stack 
repair_regs 
const  tav, 323 
asneq  69,grl,grl 
halt 


_sigret  ( ) 


;HIF  _sigret 

; commit  suicide  if  returns 


. global _ sigdf 1 

sigdf  1 : 

repair_R_stack 
repair_regs 
const  tav, 322 
asneq  69,grl,grl 
halt 


_sigdf 1 ( ) 


;HIF  _sigdfl 

; commit  suicide  if  returns 
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sigrep : 

. global _ sigrep 

repair_R_stack 

repair_regs 

const  tav,324  ;HIF  _sigrep 

asneq  69,grl,grl 

halt  ; commit  suicide  if  returns 


. global _ sigskp 

sigskp : 

repair_R_stack 
repair_regs 
const  tav,325 
asneq  69,grl,grl 
halt 


_sigskp  ( ) 


;HIF  _sigskp 

; commit  suicide  if  returns 


; =====================================================  _sendsig ( ) 

;  lr2  =  signal  number 
.global  _raise 

. global _ sendsig 

_raise : 

_ sendsig : 

const  tav,326  ;HIF  sendsig 

asneq  69,grl,grl 
jmpi  IrO 
nop 


B.3  The  Library  signal()  Routine  for  Registering  a  Handler 

The  following  code  signal()  routine  is  part  of  the  HIF  signal  support  library.  It  is 
used  to  enter  the  address  of  a  handler  routine  which  is  called  when  the  indicated  sig¬ 
nal  occurs.  Handler  addresses  are  stored  in  an  array,  indexed  by  signal  number.  The 
signal  trampoline  code  (see  section  B.  1)  looks-up  the  table  when  a  signal  occurs  and 
calls  the  registered  handler.  The  second  routine,  _signal(),  is  the  assembly  language 
glue  routine  used  to  request  the  HIF  signal  service. 


;  =======================================================  signal ( ) 

;  signal ( sig_number,  handler); 

r 

;  lr2  =  signal  number 

;  lr3  =  handler  address 

. global_signal 
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_signal : 

/  the  memory  variable  WindowSize  must  be  initalised  at  the 
/  start  when  rfb  and  rab  are  a  window  size  apart. 

const  vO, WindowSize  ; get  register  cache  size 

consth  vO, WindowSize 

load  0 , 0 , vl , vO 

cpeq  vl , vl ,  0 

jmpf  vl , WindowSizeOK 

sub  vl, rfb, rab  ; rfb-rab  =  WINDOW_SIZE 

store  0 , 0 , vl , vO 
WindowSizeOK: 


const  vl,SigEntry 

consth  vl,SigEntry 

sub  v3 , lr2 , 1 

sll  v3 , v3 , 2 

add  vl,vl,v3 

store  0 , 0 , lr3 , vl 

const  lr2,sigcode 

consth  lr2,sigcode 

;Fall  through  to  _ signal 


;  get  handler  index 
/pointer  to  addresses 

;  save  new  handler 


_signal ( ) 


. global _ signal 

_ signal : 

const  gr96, RegSigHand 
consth  gr96, RegSigHand 
store  0,0,lr2,gr96 
const  tav, 321 
asneq  69,grl,grl 
jmpi  IrO 


/User  mode  accessible  copy 
/  of  handler  address 

/HIF  signal 


nop 


.  data 

.global  RegSigHand 
RegSigHand  .word  0 
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Appendix  C 


Software  Assigned  Trap  Numbers 


The  29K  processor  hardware  assigns  tasks  to  most  traps  below  number  64.  De¬ 
buggers  and  operating  systems  assign  tasks  to  trap  numbers  64  and  above.  Over  time, 
a  number  of  the  available  trap  numbers  have  been  utilized  by  various  tools  and  prod¬ 
ucts.  Table  C-l  below  lists  the  widely  known  trap  number  assignments.  Of  course  it 
is  possible  that  a  company  may  make  changes  to  a  product  in  this  area. 

When  a  trap  number  is  required  for  a  new  operating  system  or  support  service,  it 
is  best  to  avoid  current  trap  number  usage.  This  may  enable  virtualizing  the  new  ser¬ 
vice  on  top  of  existing  operating  systems.  For  example,  the  HIF  system  call  number 
(69)  is  different  from  the  4.3bsd  UNIX  system  call  number  (66);  this  enables  the 
UNIX  operating  system  to  distinguish  HIF  service  calls  and  support  them  along  with 
native  UNIX  system  calls. 

Table  C-l.  Software  Assigned  Trap  Numbers 


Trap  Name  Number  Description 


V_BKPT 

0 

V_TRACE 

15 

V_SPILL 

64 

V_FILL 

65 

MiniMON29K  instruction  breakpoint 
trap,  processor  illegal  opcode  trap. 
Processor  trace  trap  defined  by 
hardware.. 

Spill  and  fill  support  for  high  level 
language  calling  convention. 


continued 
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Table  C-2.  Software  Assigned  Trap  Numbers  ( continued ) 


Trap  Name  Number  Description 


V_BSDCALL  66  4.3bsd  UNIX  system  call. 

V_SYSVCALL  67  System  V  UNIX  system  call 


V_MINIXCALL  68 

V_HIFCALL  69 

V_BRKCALL  70 

V_TRAP_7 1  71 

V_KSPILL  72 

V_KFILL  73 

V_DELAYED_TIMER  74 

V_DBG_MSG  75 

V_OS_MSG  76 

V_GDB_BKPT  77 

ADA_RTS_TRAP  80 

ADA_TDM_TRAP  81 

AD  A_RAIS  E_TRAP  82 

AD A_CONSTRAINT_TRAP  83 
ADA_NUMERIC_TRAP  84 

ADA_PROGRAM_TRAP  85 

AD  A_S  T  ORAGE_TRAP  86 

AD  A_TAS  KIN  G_TR  AP  87 

V_SVSCTRAP  80 

V_IOTRAP  81 

V_IRETTRAP  82 

V_DEBUG  83 


Minix  system  call. 

HIF  system  call. 

BitB locks  Inc.  realtime  kernel  support. 
DebugCore  2.0 
Reserved  for  Supervisor  mode 
spill  and  fill  support. 

JMI  Inc.  C  EXECUTIVE  support 
MiniMON29K  debug  message  trap. 
MiniMON29K  OS  message  trap. 

GDB  breakpoint. 

Ada  run-time  system  call 
(function  code  in  grll6). 

Ada  Target  Debug  Monitor  (TDM) 
service  request. 

Used  to  raise  an  exception 
(raise  code  is  in  gr90). 

Raise  constraint  error. 

Raise  numeric  error. 

Raise  program  error. 

Raise  storage  error. 

Raise  tasking  error. 

Multiprocessor  Toolsmiths  Inc. 
pSOS  operating  system  support. 


V_AMD_BIOS  88-89  AMD  Laser  printer  support  software 

continued 
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Table  C-3.  Software  Assigned  Trap  Numbers  ( concluded ) 


Trap  Name  Number  Description 

V_PCL5_TASKER  90-93  PCL5  operating  system 

V_OSE  94-97  OSE  RTOS  system  support 

V_OSBOOT  98  Export  OS-boot  information 

V_EMUALTE  140-159  Instruction  emulation  support 


V_INTR3_S UPPORT  220-252  Used  for  INTR3  overload  support. 

V_RESET  255  Used  by  MiniMON29K  to  indicate 

processor  reset. 
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Appendix  D 


DebugCore  2.0  Specification 


D.1  INTRODUCTION 

MiniMON29K  is  a  collection  of  software  components  which  can  be  used  to  de¬ 
bug  embedded  29K  code.  MiniMON29k  is  not  intended  to  be  a  standalone  monitor 
but  requires  the  support  of  an  intelligent  host  computer.  Running  on  the  host,  the 
MonTIP  program  communicates  with  the  DebugCore  software  running  on  the  target 
29K  processor.  Figure  D- 1  shows  the  29K  code  modules  of  an  example  system  using 
the  DebugCore  in  conjunction  with  the  OS-boot  operating  system.  The  DebugCore 
can  operate  in  conjunction  with  any  29K  operating  system.  This  appendix  explains 
the  operation  of  the  DebugCore  and  its  attended  Message  System  and  Configuration 
modules. 

The  use  of  the  DebugCore  is  an  inexpensive  way  to  develop  application  or  oper¬ 
ating  system  software.  Most  AMD  demonstration  and  evaluation  boards  contain  a 
version  of  MiniMON29K  modules  and  OS-boot  in  on-board  ROM.  OS-boot  is  a 
simple  operating  system  which  satisfies  the  HIF  system  calls  made  by  application 
code  linked  to  the  standard  libraries. 

There  is  no  ASCII  interface  to  the  DebugCore.  All  communication  is  via  binary 
formatted  message  packets  which  are  sent  between  MonTIP  and  the  DebugCore. 

MonTIP  is  a  UDI  conformant  Target  Interface  Process  (TIP)  which  runs  on  a 
host  processor  such  as  a  SUN  workstation  or  IBM  PC.  The  TIP  operation  is  directed 
by  the  Debugger  Front  End  (DFE)  process  which  communicates  with  the  actual  user. 
The  DFE  and  TIP  processes  communicate  via  an  Inter-Process  Communication 
(IPC)  mechanism  which  implements  the  Universal  Debugger  Interface  (UDI)  proto¬ 
col. 
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Figure  D-l.  29K  Target  Software  Module  configuration 


The  MiniMON29K  tool  bundle  includes  the  MonDFE  user  interface.  Other 
popular  DFEs  are  XRAY,  GDB,  SDB  and  CDB .  The  MonDFE  user  command  “ver” 
can  be  used  to  determine  the  version  numbers  of  all  the  MiniMON29K  components 
in  use. 

D.2  REGISTER  USAGE 

The  DebugCore  does  not  require  any  processor  registers  to  be  reserved  for  its 
use.  Any  registers  temporarily  used  by  the  DebugCore  or  the  message  system  will  be 
restored.  This  means  debugging  operating  system  or  application  code  does  not  im¬ 
pact  on  register  availability. 

The  approach  taken  with  the  DebugCore  is  to  not  acquire  or  reserve  processor 
resources  unless  directed  to  by  the  user.  For  this  reason  the  DebugCore  does  not 
install  its  own  interrupt  and  trap  handler  entry  points  into  the  Vector  Table.  The  ini¬ 
tializing  operating  system  is  required  to  do  this. 

D.3  DEBUGCORE  1.0  ENHANCEMENTS 

The  DebugCore  1.0  offered  limited  instruction  breakpoint  capabilities.  Break¬ 
points  could  not  be  set  on  a  per-task  bases  unless  on-chip  Instruction  Break  Control 
(IBC)  registers  were  used.  The  DebugCore  2.0  has  been  enhanced  to  improve  its  op¬ 
eration  when  used  in  conjunction  with  a  complex  operating  system. 
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The  DebugCore  2.0  became  available  after  August  1993,  previously  the  origi¬ 
nal  DebugCore  1 .0  was  widely  used.  The  new  DebugCore  will  work  with  all  versions 
of  MonTIP  which  where  intended  to  operate  with  DebugCore  1.0.  When  linking  the 
new  DebugCore  2.0  with  an  operating  system,  such  as  OS-boot  1 .0  it  will  be  neces¬ 
sary  to  add  code  to  operating  system  to  correctly  initialize  vector  table  entry  7 1 . 
When  this  is  done,  the  operating  system  previously  integrated  with  DebugCore  1 .0 
will  operate  with  the  new  DebugCore  2.0.  Note,  only  the  latest  version  of  OS-boot 
performs  the  necessary  Vector  Table  initialization  (see  section  7.3.4). 

There  were  two  major  additions  made  to  DebugCore  2.0.  The  first  enables  Mon¬ 
TIP  to  request  an  Operating  System  (OS)  service  function  to  be  run.  The  second  ex¬ 
tends  the  breakpoint  capability.  These  additions  were  achieved  by  using  Vector  Table 
entry  71  to  point  to  a  data  structure  shared  by  the  OS  and  the  DebugCore  2.0,  see 
Figure  D-2  below. 

To  make  use  of  the  new  DebugCore  2.0  services,  MonTIP  has  been  enhanced  to 
create  version  3.0.  MonTIP  versions  prior  to  this  release  will  work  with  DebugCore 
2.0  only  if  the  OS  data  fields  of  the  shared  data  structure  are  initialized  to  zero  by  the 
29K  operating  system. 


D.3.1  Executing  OS  Service  Functions 

It  is  the  29K  operating  system’s  responsibility  to  initialize  it’s  members  of  the 
shared  data  structure  pointed  to  by  Vector  Table  entry  7 1 .  The  DebugCore  fills  in  the 
remaining  fields  of  the  data  structure  when  dbg_control()  is  called.  The  first  entry  in 
the  table  is  the  address  of  the  message  system  function,  msg_send().  This  is  useful 
when  the  OS-boot  operating  system  is  overlaid  with  a  new  operating  system.  The 
replacement  OS  being  down  loaded  via  the  message  system.  The  address  of  the  mes¬ 
sage  send  function  was  linked  with  the  original  OS-boot  code.  The  new  OS  can  ob¬ 
tain  the  address  by  examining  the  shared  memory  structure.  This  enables  the  OS  to 
continue  using  the  message  system. 

DebugCore  2.0  supports  a  command  known  as  IEXEC.  The  IEXEC  message 
sent  by  MonTIP  provides  an  index  and  five  parameters  to  the  DebugCore.  The  index 
is  an  offset  into  the  shared  data  structure.  It  is  used  to  obtain  the  address  of  a  function 
which  will  be  called  and  passed  the  five  related  parameters.  The  minimum  value  for 
the  index,  when  accessing  OS  specific  functions,  is  21.  Lower  index  values  can  be 
used,  but  they  relate  to  shared  data  structure  addresses  whose  function  has  been  pre¬ 
defined  by  the  DebugCore. 

Functions  called  by  the  IEXEC  or  EXEC  message  need  not  always  have  five 
in-coming  parameters,  they  may  be  defined  to  have  less  without  any  difficulties.  The 
first  16-words  of  return  information  generated  by  the  function  invoked  by  the  IEX¬ 
EC  command  are  returned  to  MonTIP.  Most  functions  are  likely  to  have  a  single  word 
or  no  return  data. 
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Figure  D-2.  Data  Structure  Shared  by  Operating  System  and  DebugCore  2.0 


D.3.2  Per-Process  Breakpoints 

The  DebugCore  1 .0  set  instruction  breakpoints  at  physical  memory  addresses. 
The  memory  must  be  writable  as  the  original  instruction  is  temporarily  replaced  with 
an  illegal  opcode  instruction.  When  a  processor  with  Breakpoint  Control  (IBC)  reg¬ 
isters  is  used,  breakpoints  can  be  at  physical  or  virtual  addresses.  The  IBC  registers 
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are  set  for  physical  or  virtual  operation  depending  on  operating  system  supplied  in¬ 
formation  —  this  is  explained  in  detail  below.  DebugCore  2.0  allows  the  Break-PID 
(BPID)  to  be  specified  when  the  breakpoint  is  requested  by  MonTIP. 

D.3.3  Current  PID 

The  29K  assumes  the  current  PID  is  zero  whenever  the  process  is  operating  in 
Supervisor  mode.  However,  many  operating  systems  run  separate  tasks  in  Supervi¬ 
sor  mode  and  still  wish  to  support  per-process  breakpoints.  When  using  virtual 
memory  addressing,  the  29K  maintains  the  current  PID  in  a  field  of  the  Memory 
Management  Unit  (MMU)  register.  However  some  processors,  such  as  the 
Am29200,  do  not  support  an  MMU  and  have  no  control  register. 

DebugCore  2.0  takes  the  current  PID  from  the  PID_p  field  of  the  shared  data 
structure.  Field  PID_p  contains  the  physical  address  of  a  32-bit  memory  location 
which  identifies  the  current  PID.  If  the  field  is  zero  then  the  MMU  register  is  used  to 
obtain  the  current  PID.  When  the  PID_p  option  is  used,  it  is  likely  that  breakpoints 
will  be  located  at  physical  addresses.  However,  this  restriction  is  not  imposed  by  the 
DebugCore. 

An  operating  system  can  now  run  multiple  tasks  each  executing  in  Supervisor 
mode.  When  a  task  switch  occurs,  the  OS  must  update  the  32-bit  memory  location 
used  to  maintain  the  current  PID. 

D.3.4  Virtual  or  Physical  Breakpoints 

OS-Boot  typically  runs  application  programs  in  User  mode  with  address 
translation  turned  on.  This  is  supported  by  DebugCore  1.0  because  the  virtual  to 
physical  address  translation  is  one-to-one. 

When  instruction  breakpoints  are  established  with  DebugCore  2.0  the  shared 
data  structure  is  used  to  determine  if  the  breakpoint  is  at  a  physical  or  virtual  address. 
If  the  BPID  is  greater  than  0  (User  mode)  then  field  user_CPS  is  tested.  This  field 
should  be  initialized  by  the  OS  to  contain  a  copy  of  the  CPS  register  value  in  use  when 
execution  is  in  User  mode.  If  the  specified  BPID  is  0  then  field  super_CPS  is  tested. 
These  fields  are  tested  to  determine  the  equivalent  Pi-bit  setting.  When  the  Pi-bit  is 
set  in  the  CPS  register,  physical  addressing  is  used.  Virtual  breakpoints  are  rejected  if 
there  are  no  available  IBC  or  DBC  registers. 

Some  regions  of  operating  system  code  may  run  Supervisor  mode  with  alter¬ 
nately  virtual  and  physical  addressing.  For  example,  Freeze  mode  code  always  runs 
with  physical  addressing.  In  this  case  DebugCore  1.0  type  set-breakpoint  messages 
must  be  used  to  establish  breakpoints  at  physically  addressed  locations. 

There  is  always  some  danger  in  mixing  DebugCore  1.0  breakpoints  with  De¬ 
bugCore  2.0  breakpoints.  As  1.0  type  breakpoints  will  match  against  any  identical 
address  value,  regardless  of  the  current  PID  and  the  Pi-bit  values. 
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D.3.5  Breakpoint  Functions 


The  set-breakpoint  message  can  specify  the  physical  address  of  a  function  to 
call  when  the  ‘other’  breakpoint  conditions  have  been  met.  If  no  function  is  selected 
by  the  set-breakpoint  message  then  the  break_func  field  of  the  shared  data  structure 
supplies  the  break  function  address.  If  the  function  address  supplied  by  the  set- 
breakpoint  message  is  zero,  or  if  the  break_func  field  is  zero  then  no  function  will  be 
called. 

The  break  function  is  called  with  two  parameters:  break_func(address,  PID).  If 
the  function  returns  a  none-zero  value  then  the  breakpoint  is  taken,  otherwise  the 
‘other’  qualifying  conditions  are  updated  and  execution  continues. 

With  multi-tasking  operating  systems  the  break_func  is  a  useful  way  for  the 
DebugCore  to  inform  the  OS  that  a  task  has  hit  a  breakpoint.  The  OS  may  then  decide 
to  block  the  task  from  further  execution. 


D.4  MODULE  INTERCONNECTION 

The  following  sections  describe  the  mechanisms  used  to  connect  the  various 
29K  modules.  The  methods  and  symbols  used  with  the  DebugCore  1.0  are  main¬ 
tained  with  DebugCore  2.0.  The  only  addition  being  the  introduction  of  the  shared 
data  structure  pointed  to  by  Vector  Table  entry  7 1 . 

Procedure  calls  made  between  modules  comply  with  the  AMD  29K  calling  con¬ 
vention.  Briefly  this  means  parameters  are  passed  in  local  registers  starting  with  Zr2, 
lr3,  etc;  and  return  values  are  passed  in  global  registers  gr96-grlll .  Where  a  proce¬ 
dure  deviates  from  this  rule  the  difference  are  described. 

A  symbol  such  as  dbg_errno  will  appear  as  _dbg_errno  in  the  module  symbol 
table  and  must  be  accessed  as  _dbg_errno  is  assembly  level  code  modules.  Symbols 
which  are  only  ever  accessed  by  assembly  level  code,  such  as  vector  table  entry 
dbg_V_msg,  are  not  preceded  by  the  ‘_’  character. 

The  symbols  in  each  module  are  preceded  by  a  2  or  3  character  prefix  which  identifies 
their  source  module.  For  example  all  symbols  in  the  DBG  module  are  prefixed  by 
dbg_. 

D.4.1  The  DebugCore  2.0 

During  the  design  of  the  DebugCore  it  was  intended  that  the  code  module  would 
be  supplied  in  a  linkable  form.  There  would  be  no  requirement  to  compile  the  source 
code  for  the  DebugCore  module.  Linkable  image  files  dbg.o  and  dbg_core.o  imple¬ 
ment  the  DebugCore  services.  Most  of  the  code  was  written  in  C  for  file  dbg_core.o; 
with  file  dbg.o  supplying  the  necessary  assembly  code  glue  routines.  Figure  D-3 
shows  the  module  interconnection  symbols. 
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Figure  D-3.  DebugCore  2.0  Module 


The  following  subsections  describe  the  operation  of  each  of  the  interface  sym¬ 
bols.  There  is  no  change  from  the  DebugCore  1.0  operation. 

■  return_struct  dbg_control(int  dbg_trap_num,  int*  os-info_p) 

This  is  the  only  function  entry  point  to  the  DebugCore.  The  operating  system 
calls  this  function  after  it  has  called  msg_init()  and  completed  cold-start  initializing 
of  the  processor.  Cold-start  initialization  is  defined  to  be  the  one-time  preparations 
performed  from  processor  reset  until  dbg_control()  is  called. 

The  function  is  passed  two  parameters.  The  first  is  a  copy  of  the  value  held  in 
memory  location  dbg_trap_num.  This  is  required  to  prevent  context  saving  when  a 
coredump  had  already  been  performed.  When  dbg_control()  is  called,  the  current 
processor  context  is  saved  just  as  if  a  breakpoint  had  occurred.  For  this  reason, 
dbg_control()  should  be  called  with  traps  enabled.  The  context  is  restored  when  an 
INIT  message  is  received  by  the  DebugCore.  The  return  address  of  dbg_control()  is 
known  as  the  beginning  of  OS  warm-start  code.  This  code  performs  the  OS  initial¬ 
izations  required  before  each  new  program’s  execution.  An  INIT  message  is  normal¬ 
ly  received  after  a  new  program  has  been  installed  in  29K  memory. 

When  dbg_control()  is  called,  execution  of  29K  OS  or  application  code  is 
halted  and  a  HALT  message  is  sent  to  MonTIP.  This  can  be  prevented  If  the  first  pa- 
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rameter  passed  to  dbg_control()  is  set  to  V_NOBRK  (254);  in  such  case  dbg_con- 
trol()  returns  immediately  after  initializing  DebugCore  operation. 

The  second  parameter  passed  is  a  pointer  to  a  data  structure  prepared  by  the  OS. 
The  structure  is  shown  in  Figure  D-4  below  and  describes  available  board  resources. 
The  DebugCore  does  not  require  this  information  but  usually  passes  it  to  MonTIP 
which  makes  it  available  to  the  debug  tool  user.  The  prefixes  ‘r_’,  ‘d_’  and  ‘i_’  refer 
to  ROM,  data  and  instruction  memory  spaces.  With  3-bus  29K  processors,  they  can 
all  be  different.  With  2-bus  processors,  they  are  the  same,  and  the  data  values  should 
be  repeated.  Similarly,  microcontrollers  have  only  one  off-chip  address  space.  How¬ 
ever,  the  memory-space  fields  can  be  used  to  define  the  separate  region  sizes  (‘d_’ 
and  ‘i_’  used  together  for  DRAM,  and  ‘r_’  for  the  ROM  region). 

OS_version  higher  address 
Am29027_prl 
r_mem_size 
r_mem_start 
d_mem_size 
d_mem_start 
i_mem_size 

i_mem_start  -  |r2 

Figure  D-4.  OS  Information  Passed  to  dbg_control() 


When  dbg_control()  returns  it  provides  the  OS  with  information  taken  from  the 
INIT  message  about  the  loaded  program.  Figure  D-5  shows  the  format  of  the  return 
data. 

The  operating  system  can  use  the  return  information  to  prepare  for  program 
execution.  This  processes  is  known  as  OS  warm-start  code  execution. 


gr105 

gr104 

gr103 

gr102 

grIOl 

grIOO 

gr99 

gr98 

gr97 

gr96 


Figure  D-5.  Return  Structure  from  dbg_control() 


this  register  always  0 
Operating  System  control  info, 
start  of  command  line  args  (argv) 
register  stack  size 
memory  stack  size 
first  instruction  of  user  loaded  code 
end  address  of  program  data 
start  address  of  program  data 
end  address  of  program  text 
start  address  of  program  text 
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■  dbgjret 

When  a  29K  processor  issues  an  IRET  instruction  the  Old  Processor  Status 
(OPS)  register  is  copied  to  the  Current  Processor  Status  (CPS)  register.  Controlling 
of  single  stepping  is  achieved  with  the  Trace  Enable  (TE)  and  Trace  Pending  (TP) 
bits  in  the  CPS  register.  These  two  bits  are  maintained  by  the  DebugCore  and  not  the 
operating  system.  When  the  OS  initially  wishes  to  issue  an  IRET  it  must  make  a  jump 
to  address  dbgjret  instead.  The  Debugcore  will  then  update  the  trace  control  bits 
and  issue  an  IRET  on  behalf  of  the  OS  .Note,  it  is  only  the  initial  IRET  from  the  oper¬ 
ating  system  to  application  code  which  need  to  be  replaced  by  a  jump  to  dbgjret. 

■  dbg_coredump 

The  DebugCore  supports  coredumping  on  processor  reset  as  an  option.  When  a 
reset  occurs  execution  continues  from  address  0.  A  jump  to  dbg_coredump  can  be 
placed  at  this  location.  When  the  DebugCore  is  entered  in  this  way  the  context  of  the 
processor  is  saved  and  variable  dbg  Jrap_num  is  set  to  Y_RESET  (255)  and  execu¬ 
tion  is  continued  at  label  os_cold_start. 

Execution  of  OS  cold-start  code  causes  dbg_control()  to  be  called;  but,  due  to 
selecting  the  V_RESET  option,  the  context  is  not  saved  again  when  reentering  the 
DebugCore.  A  HALT  message  will  be  sent  to  MonTIP  and  the  user  can  examine  the 
saved  coredump  state.  Normal  operation  is  restarted  by  sending  a  RESET  message  to 
the  DebugCore.  This  causes  the  dbg  Jrap_num  variable  to  be  set  to  zero,  and  execu¬ 
tion  to  start  at  address  os_cold_start. 

■  _dbg_shadow_os 

When  the  DebugCore  gains  control  of  the  processor  it  copies  the  register  values 
to  shadow  memory  locations.  All  access  to  registers  is  then  normally  performed  to 
corresponding  memory  locations.  When  32-bit  memory  location  dbg_shadow_os  is 
set  to  zero,  global  registers  gr64-gr95  are  not  shadowed  but  accessed  directly.  This 
enables  interrupt  handlers,  which  run  in  the  context  of  the  DebugCore,  to  make 
changes  to  OS-space  registers  which  will  not  be  over-written  when  the  DebugCore 
restores  context. 

The  DebugCore  initializes  dbg_shadow_os  to  -1  and  the  alternative  option  is 
very  little  used.  However,  setting  it  to  zero  would  be  best  accomplished  in  function 

cfg_core_enter() . 

■  _dbg_shadow timer 

The  DebugCore  will  not  shadow  the  timer  control  registers  TMR  and  TMC  if 
this  32-bit  variable  is  set  to  zero.  It  is  initialized  to  -1  by  the  DebugCore  and  typicaly 
never  modified.  The  timer  is  normally  disabled  by  code  in  the  cfg_core_enter() 
function.  If  the  timer  were  enabled  it  would  be  necessary  to  clear  dbg_shadow  Jim- 
er  to  prevent  the  timer  control  registers  being  wrongly  updated  when  the  DebugCore 
restores  context. 
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■  dbg_V_bkpt 

Most  29K  processors  do  not  have  Instruction  Break  Control  (IBC)  registers. 
Consequently  the  DebugCore  implements  breakpoints  by  temporarily  placing  illegal 
opcode  instructions  at  the  corresponding  RAM  memory  locations.  Note,  when  IBC 
are  available  they  are  used-up  first  before  illegal  opcode  instructions  are  attempted. 
The  operating  system  must  place  the  DebugCore  entry  address  dbg_V_bkpt  into 
vector  number  0  of  the  processor  vector  table. 

■  dbg_V_trace 

The  DebugCore  utilizes  the  processor  Trace  mechanism  to  control  single  step¬ 
ping  of  instructions.  The  Operating  system  must  place  the  DebugCore  address 
dbg_V_trace  in  vector  number  15  of  the  processor  vector  table. 

■  dbg_trap 

There  are  many  entries  in  the  processor  vector  table  which  the  operating  system 
has  no  appropriate  action.  For  example,  trap  number  5  (Protection  Violation),  could 
result  in  the  OS  generating  a  HALT  instruction.  It  is  much  better  to  direct  these  vec¬ 
tors  to  the  DebugCore  entry  dbg_trap.  When  this  happens  a  HALT  message  with  the 
associated  trap  number  is  sent  to  MonTIP.  The  DFE  can  then  report  an  access  viola¬ 
tion  to  the  user. 

When  the  jump  to  dbg_trap  is  taken  the  OS  must  first  place  the  trap  number  in 
register  gr64  or  memory  location  dbg_trap_num. 

The  DebugCore  1.0  offered  limited  instruction  breakpoint  capabilities.  Break¬ 
points  could  not  be  set  on  a  per-task  basis  unless  Instruction  Break  Control  (IBC) 
registers  were  used.  The  DebugCore  2.0  has  been  enhanced  to  improve  its  operation 
when  used  in  conjunction  with  a  multi-tasking  operating  system. 

■  dbgmtrap 

Certain  29K  processors  support  IBC  registers  and  Monitor  mode  debugging. 
Monitor  mode  is  entered  when  a  trap  is  taken  while  the  DA  bit  is  set.  When  the  pro¬ 
cessor  enters  Monitor  mode  its  starts  execution  at  address  16.  The  OS  must  place  at 
this  address  a  jump  to  DebugCore  entry  dbg_m_trap. 

When  a  29K  processor  takes  a  Warn  trap  it  also  starts  execution  at  address  16. 
For  this  reason  the  DebugCore  returns  control  to  the  OS  at  label  os_warn  if 
dbg_m_trap  is  wrongly  entered  as  a  result  of  a  WARN  trap. 

■  dbgVmsg 

A  new  message  may  arrive  at  any  time  and  if  it  is  for  the  DebugCore  then  vector 
V_DBG_MSG  (75)  is  taken.  The  OS  is  required  to  install  the  handler  address 
dbg_V_msg  in  the  processor  vector  table.  The  message  system  identifies  Debug¬ 
Core  messages  by  their  class  number,  less  than  64  identifies  the  DebugCore.  Mes¬ 
sage  classes  64  and  greater  are  sent  to  the  operating  system. 
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■  os_cold_start 

This  is  the  starting  address  of  operating  system  cold-start  code.  For  processors 
which  do  not  use  the  dbg_coredump  option  it  is  likely  placed  at  address  0.  Addition¬ 
ally,  if  the  processor  supports  Monitor  mode  then  cold-start  code  is  likely  to  be 
placed  at  address  24,  just  after  the  jump  to  dbg_m_trap.  The  DebugCore  causes 
execution  to  start  at  the  address  given  by  the  os_cold_start  field  of  the  shared 
memory  structure  when  a  RESET  message  is  received.  However,  if  this  field  value  is 
zero  then  execution  continues  at  linked  address  label  os_cold_start. 

■  os_warn 

This  is  the  label  at  the  start  of  OS  code  dealing  with  a  Warn  trap.  For  processors 
without  Monitor  mode  it  should  be  at  address  16,  otherwise  some  higher  address.  The 
DebugCore  causes  execution  to  start  at  the  address  given  by  the  os_warn  field  of  the 
shared  memory  structure  when  the  processor  warn  ipin  is  activated.  However,  if  this 
field  value  is  zero  then  execution  continues  at  linked  address  label  os_warn. 

■  os_V_illop 

The  DebugCore  is  given  control  of  the  illegal  opcode  trap  by  the  operating  sys¬ 
tem.  However,  a  real  illegal  opcode  trap  may  occur  and  the  DebugCore  must  return 
control  to  the  OS.  This  is  achieved  by  entering  the  OS  at  address  os_V_illop  supplied 
by  the  shared  memory  structure.  The  OS  may  at  this  point  return  control  to  the  De¬ 
bugCore  at  label  dbg_trap.  If  this  is  done  then  gr64  should  not  be  set  to  zero  as  this 
will  cause  the  DebugCore  and  OS  to  enter  an  endless  loop.  However,  if  this  field  val¬ 
ue  is  zero  then  execution  continues  at  linked  address  label  os_  _V_illop. 

D.4.2  The  Message  System  1 .0 

The  message  system  provides  a  means  by  which  the  operating  system  and  the 
DebugCore  can  communicate  with  MonTIP  and  ultimately  the  DFE.  Communica¬ 
tion  with  MonTIP  is  via  message  packets.  Each  packet  has  a  two  32-bit  word  header. 
The  first  word  specifies  the  message  Class  and  the  second  the  message  Length  in  by¬ 
tes.  Note,  the  length  does  not  include  the  8-bytes  of  the  header.  File  dbg_core.h  de¬ 
fines  the  AMD  reserved  message  Class  numbers  (numbers  less  than  128).  Figure  D-6 
lists  the  messages  received  by  the  DebugCore.  There  are  corresponding  messages 
sent  by  the  DebugCore. 

The  operating  system  uses  the  message  system  to  request  HIF  services  (for  ser¬ 
vice  numbers  less  than  256)  and  for  standard  input  and  output  channel  data. 

Figure  D-7  shows  the  module  interconnection  symbols.  The  module  interacts 
with  the  DebugCore,  the  OS  and  any  additional  communications  module;  such  as  the 
serial  driver  module  (SER).  When  the  message  system  is  used  with  hardware  such  as 
PC  plug-in  boards,  which  have  shared  memory,  there  is  no  need  for  an  additional 
communications  driver  module. 
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RESET 

Reset  processor 

CONFIG_REQ 

Request  configuration  info. 

STATUS_REQ 

Request  status  info. 

READ_REQ 

Read  memory  or  registers 

WRITEREQ 

Write  memory  or  registers 

SETBKPT 

Set  a  breakpoint 

R MB KPT 

Remove  a  breakpoint 

BKPT_STAT 

Enquire  breakpoint  status 

COPY 

Copy  data  between  mem./reg. 

FILL 

Fill  mem./reg.  with  pattern 

1 N  IT 

Prepare  to  execute  a  program 

GO 

Start  or  continue  execution 

STEP 

Single  or  multi-step  execution 

BREAK 

Step  execution 

FINDREQ 

Find  patern  in  mem./reg. 

CRC_REQ 

Check  range  of  memory 

EXEC_REQ 

Execute  a  function 

IEXEC_REQ 

Execute  a  function  from  table 

Figure  D-6.  DebugCore  2.0  Receive  Messages 


msg_init() 

msg_V_msg 

os_V_msg 


msg_wait_for() 

msg_rbuf_p 


ser_init() 

ser_send() 

ser_wait_for() 

msg_V_arrive 

msg_next_p 


Figure  D-7.  Message  System  1.0  Module 


■  void  msg_init() 

The  operating  system  must  call  msg_init()  before  sending  any  messages  to 
MonTip.  This  is  normally  done  during  OS  cold-start.  Additionally,  the  OS  must  call 
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msg_init()  before  it  calls  dbg_control().  This  enables  the  DebugCore  to  start  using 
the  message  system. 

■  msgVmsg 

This  label  (called  msg_intr  in  older  code  versions)  is  the  entry  address  for  the 
interrupt  handler  dealing  with  message  reception.  If  the  29K  can  be  interrupted  when 
a  complete  message  is  available,  such  as  when  a  message  is  placed  in  shared  memory 
by  another  processor,  the  OS  should  place  address  msg_V_msg  in  the  appropriate 
vector  table  entry. 

■  struct  message  *msg_rbuf_p 

The  message  system  passes  received  messages  via  this  pointer.  For  example, 
when  the  DebugCore  accesses  a  new  incoming  message  it  uses  msg_rbuf_p  to  point 
to  the  start  of  the  message. 

■  struct  message  *msg_sbuf_p 

The  msg_send()  function  sets  the  msg_sbuf_p  variable  to  point  to  the  start  of 
the  buffer  currently  being  sent  by  the  29K  processor.  The  message  support  mecha¬ 
nism  must  ensure  that  this  pointer  is  set  to  zero  after  the  current  message  has  been 
received  by  the  TIP  on  the  host  processor.  For  message  systems  which  rely  on 
UARTs,  the  UART  driver  must  ensure  that  msg_sbuf_p  is  set  to  zero  when  the  last 
character  of  the  current  message  has  been  sent. 

■  int  msg_send (struct  message  *message_pointer) 

This  function  is  used  to  send  all  messages  to  MonTIP.  Both  the  OS  and  Debug¬ 
Core  use  this  function.  It  is  passed  a  pointer  to  the  start  of  the  message  and  returns  the 
number  of  characters  accepted  for  transmission.  If  the  message  system  is  currently  to 
busy  to  accepted  the  message,  the  function  returns  -1. 

■  int  msg_wait_for() 

The  DebugCore  can  receive  a  message  at  any  time  if  interrupts  are  enabled. 
However  when  the  DebugCore  is  waiting  for  a  message  it  calls  msg_wait_for().  This 
function  returns  zero  immediately  if  no  message  is  available  and  the  DebugCore 
must  wait  for  a  message  system  interrupt.  If  the  message  system  is  controlling  the 
communications  hardware  via  polling  (with  interrupts  turned  off),  then 
msg_wait_for()  will  return  -1  when  a  complete  new  message  is  received. 

■  osVmsg 

When  the  message  system  operates  with  interrupts  enabled  a  message  may  be 
received  at  any  time  from  MonTIP.  If  the  message  class  is  64  or  greater  the  operating 
system  will  be  interrupted.  The  OS  must  install  the  entry  address  os_V_msg  for  vec¬ 
tor  number  V_OS_MSG  (76)  in  the  processor  vector  table. 

■  void  ser_init() 

When  the  message  system  is  initialized  with  a  call  to  msg_init(),  it  will  call  the 
initialization  function  for  any  hardware  driver  module.  In  the  case  of  the  serial  driver 
module  the  function  ser_init()  is  called. 
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■  ser_send(char*  buf_p,  int  nchar) 

When  the  message  system  is  not  supported  by  hardware  which  can  communi¬ 
cate  whole  messages,  such  as  a  dual  port  memory  system;  it  calls  a  hardware  driver 
such  as  ser_send()  to  send  a  message  via  the  selected  hardware. 

■  int  ser_wait_for() 

This  function  operates  in  conjunction  with  msg_wait_for().  The  return  value  is 
-1  if  a  message  is  available.  With  a  poll  mode  driver,  this  function  should  not  return 
until  a  new  message  has  been  received. 

■  msg_V_arrive 

When  communications  hardware  is  receiving  a  message  character-at-a-time, 
each  new  character  causes  the  message  module  to  be  interrupted  by  entry  point 
msg_V_arrive.  For  example,  the  serial  driver  jumps  to  this  label  after  completing 
any  hardware  support  tasks  and  restoring  any  saved  intermpt  context.  The  code  at 
label  msg_V_arrive  determines  if  a  complete  message  has  arrived  and  generates  an 
os_V_msg  or  dbg_V_msg  interrupt  if  required;  otherwise  it  issues  an  IRET  instruc¬ 
tion. 

■  char*  msg_next_p 

This  pointer  records  the  current  position  in  the  message  receive  buffer.  As  each 
character  is  received  by  the  communications  hardware,  and  an  interrupt  results  in  the 
hardware  driver  passing  control  to  msg_V_arrive,  msg_next_p  identifies  the  char¬ 
acter  position  to  save  the  received  character.  It  is  incremeneted  each  time  msg_V_ar- 
rive  is  entered. 


D.4.3  The  DebugCore  2.0  Configuration 

Although  AMD  makes  the  source  to  the  DebugCore  module  available,  it  is  in¬ 
tended  that  users  only  require  the  DebugCore  in  a  linkable  form.  The  DebugCore 
does  not  access  off-chip  resources  directly,  it  is  left  to  the  configuration  module.  This 
module  is  supplied  in  source  and  enables  simple  configuration  of  off-chip  accesses 
made  by  the  DebugCore.  Figure  D-8  shows  the  symbol  interface  to  the  module. 

The  configuration  module  contains  code  for  all  the  popular  29K  configurations. 
It  is  usually  only  necessary  to  assembly  the  module  with  the  correct  assembler  direc- 


cfg_peek() 
cfg_poke() 
cfg_core_enter() 
dbg_errno 


CFG 

■^DebugCore 

configuration 


Figure  D-8.  Configuration  Module 
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tive  to  produce  a  configuration  for  most  29K  platforms .  In  a  rare  case  it  may  be  neces¬ 
sary  to  modify  the  source  code. 

For  example,  if  a  29K  system  had  some  special  FIFO  hardware  which  had  to  be 
accessed  is  a  restricted  way,  the  peek  and  poke  functions  would  need  modification  to 
deal  with  the  necessary  protocol. 

■  void  cfg_peek(to,  from, count, space, size) 

All  reads  of  off-chip  resources  are  performed  by  this  function.  The  from  address 
identifies  a  DebugCore  data  region  which  is  to  receive  the  data.  The  count  parameter 
refers  to  the  number  of  objects  of  size  size  (measured  in  characters).  The  space  pa¬ 
rameter  identifies  the  address  space  for  the  source  data  at  address  from;  this  parame¬ 
ter  is  mostly  iused  by  3-bus  procesors. 

■  void  cfg_poke(to,  from, count, space, size) 

All  writes  to  off-chip  resources  are  performed  by  this  function.  The  data  is  sup¬ 
plied  from  a  DebugCore  data  region  pointed  to  by  the  from  parameter. 

■  void  cfg_core_enter() 

This  function  is  called  when  the  DebugCore  gains  control  for  any  reason  (other 
than  the  call  to  dbg_control()) .  It  is  called  after  the  DebugCore  has  saved  context  and 
requested  a  HALT  message  be  sent  to  MonTIP.  It  is  called  before  the  DebugCore 
starts  processing  any  received  message  from  MonTIP. 

It  enables  the  environment  in  which  the  DebugCore  runs  to  be  modified.  For  ex¬ 
ample,  cfg_core_enter()  normally  turns  interrupts  on.  If  it  is  desired  that  interrupts 
should  remain  off  during  DebugCore  operation,  then  the  TD  and  DI  bits  should  be  set 
in  this  function. 

■  int  dbg_errno 

This  32-bit  memory  location  is  used  by  the  configuration  module  to  indicate  an 
error  with  a  requested  peek  or  poke  operation. 
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Programmable  Bus  Sizing,  17 
Pushing,  180 

Q 

Q  special  register,  59 
QTC  Inc.,  122 
query  HIF  service,  494 
Queuing  interrupts,  218 

R 

RA  Register,  64,  65 
rab,  relationship  to  rfb,  250 
RAM,  initialization,  118 
RAMInit(),  118 
RB  or  I,  64,  65 
RB  register,  64,  65 
RC  register,  64 
rdcoff,  157 

read  HIF  service,  457 
Real-time  clock,  222 

Real-time  operating  system.  See  operating 
system 

Reason  Vector  (RSN),  56,  203 

Region  mapping,  12 
debugging,  334 

Register  allocation  bound  (rab),  91,  97 

Register  Bank  Protect  (RBP),  51,  54 
setting,  333 

Register  cache,  91 

Register  copying.  See  reservation  stations 


Register  file,  multiport  access,  47 

Register  free  bound  (rfb),  91,  97 

Register  renaming,  25,  27 

Register  space,  46 

Register  stack,  91 
asynchronous  context  switch,  249 
cache  size,  97,  111,  200,  444 
interrupt  processing,  231 
interrupted  condition,  251 
operation,  226,  228 
rfb  anchor,  250 
stack  cut  — across,  262 
stack  pointer,  48,  91,  97 
support,  165 

Registers 
for  interrupts,  180 
for  operating  systems,  180 
general  purpose,  47 
global,  47,  48 
indirect  access,  181 
local,  47,  48 

Reload  blocking,  276 

remove  HIF  service,  464 

rename  HIF  service,  465 

Reorder  buffer,  27 

Reservation  stations,  25 

Reserved  instructions,  176 

Reset  mode,  190,  343 

Retiring  instructions,  29 

rfb,  the  anchor  register,  230,  250 

RI_text,  118 

RISC,  3 

ROM,  programming,  159 

ROM  emulator,  362,  398 
NetROM,  367,  397 

ROM  space,  171 

romcoff,  118 

Rotate  instructions,  165 
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SA29200,  39 

SA29205,  41 

SAXPY,  184 

Scalable  Clocking,  15 

Scheduling  LOAD  instructions,  68 

Scoreboarding,  22,  32 

SD29240,  45 

SE29240,  45 

sed,  130 

Semaphores,  290 

sendsig,  234 
HIF  service,  504,  519 

Serialization,  50,  243 

Set  associative  address  translation,  302 

setim  HIF  service,  492 

setjmp(),  120 
code  dependencies,  223 
operating  system  support,  236 

settrap  HIF  service,  490 

setvec  HIF  service,  488 

Shadow  Program  Counter,  57,  203 
debugging,  336 

Shared  libraries,  44 

Shift  instructions,  165 

sig_sig,  137 

SigArray,  257 

sigcode,  232 

sigdfl  HIF  service,  500,  516,  518 
SigEntry,  140,  236,  512 
signal 

HIF  service,  519 
HIF  support,  496 

Signal  Processing,  184 


signal(),  131 
signal_associate,  131 
Signals 

Dispatching,  143 
HIF  services,  496 
HIF  support,  512 
introduction,  222 
longjmp(),  254 
return  services,  226,  264 
sigcode,  512 
SigEntry,  512 
signal(),  519 

signaling  a  User  mode  task,  254 
sigreturn  service,  258 
table  of  handlers,  236,  257 
trampoline  code,  232,  512 
trampoline  code  dependencies,  223 

sigrep  HIF  service,  502,  519 
sigret  HIF  service,  501,  518 
sigskp  HIF  service,  503 
SIGUSR1, 137, 140 
sim29,  70 
Simulation 

architectural,  sim29,  70 
instruction  set,  isstip,  368 

Single-stepping  execution,  334 

Sleep  mode,  19 

Snooping  cache,  288 

Snooze  mode,  19 

Software  controlled  cache 
cache  line  locked,  325 
copying  a  page  into  the  cache,  322 
copying  a  page  out  of  the  cache,  323 
description,  296 
memory  architecture,  310 
page  dirty  bit,  320,  325 
page  maintenance,  313 
performance  equation,  308 
performance  gain,  328 
returning  from  signal  handler,  326 
signal  handler,  320 
support  routines,  327 
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Special  Register,  49 
accessing,  183 
accessing  from  C,  144 
ALU  support,  59 
Am29030  extensions,  58 
Am29050  extensions,  57 
asynchronous  context  switch,  250 
channel  (CHA,  CHD,  CHC),  53 
channel  control,  59 
Configuration  (CFG),  51,  52 
Current  Processor  Status  (CPS),  51,  52 
floating-point  support,  58 
indirect  pointers,  (IPA,  IPB,  IPC),  59 
memory  management  unit,  56 
Old  Processor  Status  (OPS),  52 
program  counter,  (PCO,  PCI,  PC2),  54 
Q,  59 

Register  Bank  Protect  (RBP),  54 

task  context  switch,  243 

timer  control  (TMC,  TMR),  54 

User  mode  accessible,  58 

Vector  Area  Base  Address  (VAB),  50,  51 

virtualizing,  58 

SPILL,  93,  97, 120 
code  dependencies,  223 
interruption,  231,  250 
trampoline  code,  228 
User  mode  handler  code,  111,  229 

SRAM,  296 

Stack  cache,  91 

Stack  cut— across,  262 

Stack  Pointer,  48 

Stanford  benchmark,  418 

Start  label  inside  crtO.s,  97 

Static  Link  Pointer  (sip),  179 

Stewart,  [1991],  89 

Storage  conflicts,  27 

STORE  and  STOREM,  168, 171 


consecutive,  172 

in  Freeze  mode,  203 

OPT -field,  63, 171 

set  byte  pointer  (SB) -bit,  171 

User  mode  access  (UA— bit),  245 

STOREL,  290 

STOREM 

Fligh  C  29K  compiler,  124 
interruption,  53 

Strength  reduction,  106 

strpcoff,  158 

Superblock,  110 

Superscalar  processor 
29K  processor,  20 
antidependency,  24 
instruction  window,  23 
out— of— order  instruction  completion,  21 
register  renaming,  27 
scoreboarding,  22 
write  — read  dependency,  22,  24 
write— write  dependency,  23 

Supervisor  mode 
accessing  from  C,  103 
definition,  333 
interrupt  handlers,  199 
interrupting,  258 

swaf,  157,  374,  387 

Synchronous  context  switch,  description, 
242 

sysalloc,  119 
HIF  service,  481 

sysfree  HIF  service,  482 

System  calls 

See  also  Host  Interface  (HIF) 
description,  261 
stack  support,  258 
task  context  switch,  243 
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Tail  recursion,  110 

Target  Interface  Process  (TIP),  361 

tav,  179 

Thorton,  [1970],  22 
Threads,  291 

time  HIF  service,  103,  476 

Timer  Control  Registers  (TMC,  TMR),  54 

Timer  Counter  Register  (TMC),  51 

Timer  interrupt,  221 
handler,  204 

Timer  Reload  Register  (TMR),  51 
tmpnam  HIF  service,  474 
Tomasulo,  [1967],  25 
Tool  selection,  359 
tpc,  179 

Trace— back  tags,  186,  372 
Traceable  cache,  387 
Tracing,  383 

single  stepping  instructions,  334 

Trampoline 
code,  131,  512 
interrupt  processing,  225 
signal  handling,  257 

Transcendental  routines,  122,  293 

Translation  look-aside  buffer  (TLB) 
data  miss,  315 
data  write  protection,  319 
instruction  miss,  318 
line,  302 
miss  rates,  309 
MMU  construction,  296 
one— to  — one  mapping,  334 
operating  system  support,  304 
operation,  300 
registers,  61 
set  associative,  302 


trapware,  307 

Transparent  mode,  391 

Transparent  routines,  182, 190 
compiler  optimizations,  109 

Trap,  overview,  195 

Trap  63, 176 

TRAP[1,0],  213 
input  pins,  222 

TRUE,  101, 164 

True  dependencies,  22 

Two— pass  compilation,  110 

u 

UDB  debugger,  377 
frame,  381 
window,  381 

udb.rc,  401,  405 
udi_soc,  366 
udiconfs.txt,  366 
UMON,  377 

Unaligned  memory  access,  165, 171 
from  C,  111 

Universal  Debug  Interface  (UDI),  359 
benefits,  369 
GDB,  364,  370 
ISSTIP,  368 
MonTIP,  365 
p- trace,  363 
specification,  361 
tool  developers,  360 
udi_soc  and  udiconfs.txt  configuration, 
366 

University  Support  Program,  192 

User  mode 
definition,  333 
interrupt  processing,  222 
interrupting,  250 

Utility  programs,  156 
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V 

Vector  Area  Base  (VAB),  50 
settrap  HIF  service,  490 

Video  DRAM,  37 

Video  imaging,  38 

Video  interface,  39 

Virtual  memory  addressing 
cache  invalidation,  10,  272 
overlapping  address  space,  63 

VN,  65 


Volatile,  282,  286 
compiler  optimizations,  108 
Variables  in  C,  149 

w 

Warm— start  code,  345 
write  HIF  service,  459 
Write— read  dependency,  22,  24 
Write— write  dependency,  23 
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